基于BERT与Text-CNN的抗菌肽识别方法

doi:10.13345/j.cjb.220878

科微学术

生物工程学报

首页 > 过刊浏览>2023年第39卷第4期 >1815-1824. DOI:10.13345/j.cjb.220878

PDF HTML阅读 XML下载导出引用引用提醒

基于BERT与Text-CNN的抗菌肽识别方法
DOI:
                        10.13345/j.cjb.220878
                    
CSTR:
                        32114.14.j.cjb.220878
                    
作者:
                        徐小放徐小放
重庆邮电大学计算机科学与技术学院, 重庆 400065;军事科学院军事医学研究院生命组学研究所 国家蛋白质科学中心(北京)北京蛋白质组研究中心 蛋白质组学国家重点实验室, 北京 102206
在期刊界中查找
在百度中查找
在本站中查找
杨春德杨春德
重庆邮电大学计算机科学与技术学院, 重庆 400065
在期刊界中查找
在百度中查找
在本站中查找
舒坤贤舒坤贤
重庆邮电大学 大数据生物智能重庆市重点实验室, 重庆 400065
在期刊界中查找
在百度中查找
在本站中查找
袁新普袁新普
解放军总医院 第一医学中心普通外科医学部, 北京 102206
在期刊界中查找
在百度中查找
在本站中查找
李默程李默程
国防科技大学计算机学院 量子信息研究所兼高性能计算国家重点实验室, 湖南 长沙 410073
在期刊界中查找
在百度中查找
在本站中查找
朱云平朱云平
军事科学院军事医学研究院生命组学研究所 国家蛋白质科学中心(北京)北京蛋白质组研究中心 蛋白质组学国家重点实验室, 北京 102206
在期刊界中查找
在百度中查找
在本站中查找
陈涛陈涛
军事科学院军事医学研究院生命组学研究所 国家蛋白质科学中心(北京)北京蛋白质组研究中心 蛋白质组学国家重点实验室, 北京 102206
在期刊界中查找
在百度中查找
在本站中查找

                    
作者单位:
作者简介:
通讯作者:
中图分类号:
基金项目:国家重点研发计划(2021YFA1301603)

An antibacterial peptides recognition method based on BERT and Text-CNN

Author:

XU Xiaofang
XU Xiaofang
The School of Computer Science and Technology, Chongqing University of Posts and Telecommunications, Chongqing 400065, China;State Key Laboratory of Proteomics, Beijing Proteome Research Center, National Center for Protein Sciences (Beijing), Institute of Lifeomics, Academy of Military Medical Sciences, Academy of Military Sciences, Beijing 102206, China
在期刊界中查找
在百度中查找
在本站中查找
YANG Chunde
YANG Chunde
The School of Computer Science and Technology, Chongqing University of Posts and Telecommunications, Chongqing 400065, China
在期刊界中查找
在百度中查找
在本站中查找
SHU Kunxian
SHU Kunxian
Chongqing Key Laboratory on Big Data for Bio-Intelligence, Chongqing University of Posts and Telecommunications, Chongqing 400065, China
在期刊界中查找
在百度中查找
在本站中查找
YUAN Xinpu
YUAN Xinpu
Department of General Surgery, First Medical Center, Chinese PLA General Hospital, Beijing 102206, China
在期刊界中查找
在百度中查找
在本站中查找
LI Mocheng
LI Mocheng
State Key Laboratory of High Performance Computing, Institute for Quantum Information, College of Computer, National University of Defense Technology, Changsha 410073, Hunan, China
在期刊界中查找
在百度中查找
在本站中查找
ZHU Yunping
ZHU Yunping
State Key Laboratory of Proteomics, Beijing Proteome Research Center, National Center for Protein Sciences (Beijing), Institute of Lifeomics, Academy of Military Medical Sciences, Academy of Military Sciences, Beijing 102206, China
在期刊界中查找
在百度中查找
在本站中查找
CHEN Tao
CHEN Tao
State Key Laboratory of Proteomics, Beijing Proteome Research Center, National Center for Protein Sciences (Beijing), Institute of Lifeomics, Academy of Military Medical Sciences, Academy of Military Sciences, Beijing 102206, China
在期刊界中查找
在百度中查找
在本站中查找

Affiliation:

Fund Project:

摘要

图/表

访问统计

参考文献 [48]

相似文献 [20]

引证文献

资源附件

文章评论

摘要:

抗菌肽(antimicrobial peptides, AMPs)广泛存在于生命体中，是一种具有广谱抗菌活性、免疫调节功能的小分子多肽。抗菌肽不易产生耐药性，适用范围广，具有极大的临床价值，是传统抗生素的有力竞争者。识别抗菌肽是抗菌肽研究领域中的重要研究方向，湿实验法在进行大规模抗菌肽识别时存在成本高、效率低、周期长等难点，计算机辅助识别法是抗菌肽识别手段的重要补充，如何提升准确率是其中的关键问题。蛋白质序列可以被近似地看作是由氨基酸组成的语言，运用自然语言处理(natural language processing, NLP)技术可能提取到丰富的特征。本文将自然语言处理领域中的预训练模型BERT和微调结构Text-CNN结合，对蛋白质语言进行建模，提供了开源可用的抗菌肽识别工具，并与已发表的5种抗菌肽识别工具进行了比较。结果表明，优化“预训练-微调”策略带来了准确率、敏感度、特异性和马修相关系数的整体提升，为进一步研究抗菌肽识别算法提供了新思路。

关键词:蛋白质;抗菌肽;语言模型;预训练

Abstract:

Antimicrobial peptides (AMPs) are small molecule peptides that are widely found in living organisms with broad-spectrum antibacterial activity and immunomodulatory effect. Due to slower emergence of resistance, excellent clinical potential and wide range of application, AMP is a strong alternative to conventional antibiotics. AMP recognition is a significant direction in the field of AMP research. The high cost, low efficiency and long period shortcomings of the wet experiment methods prevent it from meeting the need for the large-scale AMP recognition. Therefore, computer-aided identification methods are important supplements to AMP recognition approaches, and one of the key issues is how to improve the accuracy. Protein sequences could be approximated as a language composed of amino acids. Consequently, rich features may be extracted using natural language processing (NLP) techniques. In this paper, we combine the pre-trained model BERT and the fine-tuned structure Text-CNN in the field of NLP to model protein languages, develop an open-source available antimicrobial peptide recognition tool and conduct a comparison with other five published tools. The experimental results show that the optimization of the two-phase training approach brings an overall improvement in accuracy, sensitivity, specificity, and Matthew correlation coefficient, offering a novel approach for further research on AMP recognition.

Key words:protein;antibacterial peptides;language model;pre-training

参考文献

[1] ASLAM B, WANG W, ARSHAD MI, KHURSHID M, MUZAMMIL S, RASOOL MH, NISAR MA, ALVI RF, ASLAM MA, QAMAR MU, SALAMAT MKF, BALOCH Z. Antibiotic resistance:a rundown of a global crisis[J]. Infection and Drug Resistance, 2018, 11:1645-1658.

[2] MAGANA M, PUSHPANATHAN M, SANTOS AL, LEANSE L, FERNANDEZ M, IOANNIDIS A, GIULIANOTTI MA, APIDIANAKIS Y, BRADFUTE S, FERGUSON AL, CHERKASOV A, SELEEM MN, PINILLA C, deLa FUENTE-NUNEZ C, LAZARIDIS T, DAI TH, HOUGHTEN RA, HANCOCK REW, TEGOS GP. The value of antimicrobial peptides in the age of resistance[J]. The Lancet Infectious Diseases, 2020, 20(9):e216-e230.

[3] BROWNE K, CHAKRABORTY S, CHEN RX, WILLCOX MD, STCLAIR BLACK D, WALSH WR, KUMAR N. A new era of antibiotics:the clinical potential of antimicrobial peptides[J]. International Journal of Molecular Sciences, 2020, 21(19):7047.

[4] LEI J, SUN LC, HUANG SY, ZHU CH, LI P, HE J, MACKEY V, COY DH, HE QY. The antimicrobial peptides and their potential clinical applications[J]. American Journal of Translational Research, 2019, 11(7):3919-3931.

[5] 于伟康, 张珊珊, 杨占一, 王家俊, 单安山. 超分子多肽自组装在生物医学中的应用[J]. 生物工程学报, 2021, 37(7):2240-2255. YU WK, ZHANG SS, YANG ZY, WANG JJ, SHAN AS. Application of supramolecular peptide self-assembly in biomedicine[J]. Chinese Journal of Biotechnology, 2021, 37(7):2240-2255(in Chinese).

[6] HUAN YC, KONG Q, MOU HJ, YI HX. Antimicrobial peptides:classification, design, application and research progress in multiple fields[J]. Frontiers in Microbiology, 2020, 11:582779.

[7] MEDEMA MH, FISCHBACH MA. Computational approaches to natural product discovery[J]. Nature Chemical Biology, 2015, 11(9):639-648.

[8] LI X, WU M, KWOH C-K, NG S-K.Computational approaches for detecting protein complexes from protein interaction networks:a survey[J]. BMC Genomics,2010, 11(1):10.1186/1471-2164-11-S1-S3.

[9] KÜKEN A, NIKOLOSKI Z. Computational approaches to design and test plant synthetic metabolic pathways[J]. Plant Physiology, 2019, 179(3):894-906.

[10] XIAO X, WANG P, LIN WZ, JIA JH, CHOU KC. iAMP-2L:a two-level multi-label classifier for identifying antimicrobial peptides and their functional types[J]. Analytical Biochemistry, 2013, 436(2):168-177.

[11] LIN Y, CAI Y, LIU J, LIN C, LIU X. An advanced approach to identify antimicrobial peptides and their function types for penaeus through machine learning strategies[J]. BMC Bioinformatics,2019, 20(8):10.1186/s12859-019-2766-9.

[12] YOUMANS M, SPAINHOUR C, QIU P. Long short-term memory recurrent neural networks for antibacterial peptide identification[C]//2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). November 13-16, 2017, Kansas City, MO, USA. IEEE, 2017:498-502.

[13] VELTRI D, KAMATH U, SHEHU A. Deep learning improves antimicrobial peptide recognition[J]. Bioinformatics, 2018, 34(16):2740-2747.

[14] ZHANG Y, LIN JY, ZHAO LM, ZENG XX, LIU XR. A novel antibacterial peptide recognition algorithm based on BERT[J]. Briefings in Bioinformatics, 2021, 22(6):bbab200.

[15] DEVLIN J, CHANG M, LEE K, TOUTANOVA K. Bert:pre-training of deep bidirectional transformers for language understanding[EB/OL]. 2018:ar Xiv:1810. 04805. https://arxiv.org/abs/1810.04805.

[16] OFER D, BRANDES N, LINIAL M. The language of proteins:NLP, machine learning & protein sequences[J]. Computational and Structural Biotechnology Journal, 2021, 19:1750-1758.

[17] TRAN NH, ZHANG X, XIN L, SHAN B, LI M. De novo peptide sequencing by deep learning[J]. Proceedings of the National Academy of Sciences, 2017, 114(31):8247-8252.

[18] QIAO R, TRAN NH, XIN L, CHEN X, LI M, SHAN BZ, GHODSI A. Computationally instrument- resolution-independent de novo peptide sequencing for high-resolution devices[J]. Nature Machine Intelligence, 2021, 3(5):420-425.

[19] VASWANI A, SHAZEER N, PARMAR N, USZKOREIT J, JONES L, GOMEZ AN, KAISER Ł, POLOSUKHIN I. Attention is all You need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. December 4-9, 2017, Long Beach, California, USA. New York:ACM, 2017:6000-6010.

[20] GILLIOZ A, CASAS J, MUGELLINI E, ABOU KHALED O. Overview of the transformer-based models for NLP tasks[C]//Proceedings of the 2020 Federated Conference on Computer Science and Information Systems, Annals of Computer Science and Information Systems. September 6-9, 2020. IEEE, 2020:179-183.

[21] SINGH S, MAHMOOD A. The NLP cookbook:modern recipes for transformer based deep learning architectures[J]. IEEE Access, 2021, 9:68675-68702.

[22] HUANG XN, BI N, TAN J. Visual transformer-based models:a survey[M]//Pattern Recognition and Artificial Intelligence. Cham:Springer International Publishing, 2022:295-305.

[23] BIAN ZD, LI SG, WANG W, YOU Y. Online evolutionary batch size orchestration for scheduling deep learning workloads in GPU clusters[C]//Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. November 14-19, 2021, St. Louis, Missouri. New York:ACM, 2021:1-15.

[24] RAJPURKAR P, ZHANG J, LOPYREV K, LIANG P. Squad:100000+ questions for machine comprehension of text[C]//Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016:2383-2392.

[25] ACHEAMPONG FA, NUNOO-MENSAH H, CHEN WY. Transformer models for text-based emotion detection:a review of BERT-based approaches[J]. Artificial Intelligence Review, 2021, 54(8):5789-5829.

[26] KOROTEEV M.BERT:a review of applications in natural language processing and understanding[J]. arXiv preprint arXiv:2103.11943,2021.

[27] KIM Y. Convolutional neural networks for sentence classification[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, Qatar. Stroudsburg, PA, USA:Association for Computational Linguistics, 2014:1746-1751.

[28] CONSORTIUM TU. UniProt:a worldwide hub of protein knowledge[J]. Nucleic Acids Research, 2019, 47(D1):D506-D515.

[29] SHI GB, KANG XY, DONG FY, LIU YC, ZHU N, HU YX, XU HM, LAO XZ, ZHENG H. DRAMP 3.0:an enhanced comprehensive data repository of antimicrobial peptides[J]. Nucleic Acids Research, 2022, 50(D1):D488-D496.

[30] ABADI M. TensorFlow:learning functions at scale[C]//Proceedings of the 21st ACM SIGPLAN International Conference on Functional Programming. September 18-24, 2016, Nara, Japan. New York:ACM, 2016:1.

[31] PASZKE A, GROSS S, MASSA F, LERER A, BRADBURY J, CHANAN G, KILLEEN T, LIN Z, GIMELSHEIN N, ANTIGA L.Pytorch:an imperative style, high-performance deep learning library[J]. Advances in Neural Information Processing Systems, 2019, 32.

[32] TENNEY I, DAS D, PAVLICK E. BERT rediscovers the classical NLP pipeline[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy. Stroudsburg, PA, USA:Association for Computational Linguistics, 2019:4593-4601.

[33] CLARK K, KHANDELWAL U, LEVY O, MANNING CD. What does bert look at? An analysis of bert's attention[C]//Proceedings of the 2019 ACL Workshop BlackboxNLP:Analyzing and Interpreting Neural Networks for NLP. 2019:276-286.

[34] TIAN S, QI P, XUE HJ, SUN Y. C-BERTT:a BERT-based model for extractive summarization of Chinese online judge questions[C]//2021 Ninth International Conference on Advanced Cloud and Big Data (CBD). March 26-27, 2022, Xi'an, China. IEEE, 2022:127-132.

[35] JIANG XC, SONG C, XU YC, LI Y, PENG YL. Research on sentiment classification for netizens based on the BERT-BiLSTM-TextCNN model[J]. PeerJ Computer Science, 2022, 8:e1005.

[36] LU DM. daminglu123 at SemEval-2022 task 2:using BERT and LSTM to do text classification[C]//Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022). Seattle, USA. Stroudsburg, PA, USA:Association for Computational Linguistics, 2022:186-189.

[37] WANG B, KUO CC J. SBERT-WK:a sentence embedding method by dissecting BERT-based word models[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2020, 28:2146-2157.

[38] SUN SQ, CHENG Y, GAN Z, LIU JJ. Patient knowledge distillation for BERT model compression[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China. Stroudsburg, PA, USA:Association for Computational Linguistics, 2019:4323-4332.

[39] CHOI H, KIM J, JOE S, GWON Y. Evaluation of BERT and ALBERT sentence embedding performance on downstream NLP tasks[C]//202025th International Conference on Pattern Recognition (ICPR). January 10-15, 2021, Milan, Italy. IEEE, 2021:5482-5487.

[40] KIM T, YOO KM, LEE SG. Self-guided contrastive learning for BERT sentence representations[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1:Long Papers). Online. Stroudsburg, PA, USA:Association for Computational Linguistics, 2021:2528-2540.

[41] REIMERS N, GUREVYCH I, REIMERS N, GUREVYCH I, THAKUR N, REIMERS N, DAXENBERGER J, GUREVYCH I, REIMERS N, GUREVYCH I. Sentence-BERT:sentence embeddings using Siamese BERT-Networks[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019:3982-3992.

[42] JIANG ZY, TANG R, XIN J, LIN J. How does BERT rerank passages? An attribution analysis with information bottlenecks[C]//Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP. Punta Cana, Dominican Republic. Stroudsburg, PA, USA:Association for Computational Linguistics, 2021:496-509.

[43] PAL K, PATEL BV. Data classification with k-fold cross validation and holdout accuracy estimation methods with 5 different machine learning techniques[C]//2020 Fourth International Conference on Computing Methodologies and Communication (ICCMC). March 11-13, 2020, Erode, India. IEEE, 2020:83-87.

[44] van der VELDEN BHM, JANSE MHA, RAGUSI MAA, LOO CE, GILHUIJS KGA. Volumetric breast density estimation on MRI using explainable deep learning regression[J]. Scientific Reports, 2020, 10:18095.

[45] OLIVEIRA M, TORGO L, SANTOS COSTA V. Evaluation procedures for forecasting with spatiotemporal data[J]. Mathematics, 2021, 9(6):691.

[46] LIN TY, GOYAL P, GIRSHICK R, HE KM, DOLLÁR P. Focal loss for dense object detection[C]//2017 IEEE International Conference on Computer Vision (ICCV). October 22-29, 2017, Venice, Italy. IEEE, 2017:2999-3007.

[47] LI WY, SEPAROVIC F, O'BRIEN-SIMPSON NM, WADE JD. Chemically modified and conjugated antimicrobial peptides against superbugs[J]. Chemical Society Reviews, 2021, 50(8):4932-4973.

[48] MORETTA A, SCIEUZO C, PETRONE AM, SALVIA R, DARIO MANNIELLO M, FRANCO A, LUCCHETTI D, VASSALLO A, VOGEL H, SGAMBATO A, FALABELLA P. Antimicrobial peptides:a new hope in biomedical and pharmaceutical fields[J]. Frontiers in Cellular and Infection Microbiology, 2021, 11:668632.