AcidBasePred:基于深度学习的蛋白酸碱耐受性预测平台
作者:
基金项目:

国家重点研发计划(2021YFC2103500)


AcidBasePred: a protein acid-base tolerance prediction platform based on deep learning
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [24]
  • | | | |
  • 文章评论
    摘要:

    酶的结构和活性受环境pH值的影响。了解酶对极端pH值的适应机制并进行区分,对于阐明酶的分子机制和工业应用具有重要意义。本研究利用ESM-2蛋白质语言模型对最适pH值大于等于9和/或小于等于5的微生物的分泌蛋白进行编码,分别获得了47 725条和66 079条数据。在此基础上,本研究构建了一个基于氨基酸序列判别蛋白酸碱耐受性的深度学习模型。该模型准确率显著超过其他方法,在测试集上的整体准确率为94.8%,精确率为91.8%、召回率为93.4%。同时搭建了一个web预测平台(https://enzymepred.biodesign.ac.cn),用户可以直接提交酶的蛋白质序列,预测其酸碱耐受性。本研究加速了酶在生物技术、制药和化工等多个领域的应用进程,为工业酶的快速筛选与优化提供了强有力的工具。

    Abstract:

    The structures and activities of enzymes are influenced by pH of the environment. Understanding and distinguishing the adaptation mechanisms of enzymes to extreme pH values is of great significance for elucidating the molecular mechanisms and promoting the industrial applications of enzymes. In this study, the ESM-2 protein language model was used to encode the secreted microbial proteins with the optimal performance above pH 9 and below pH 5, which yielded 47 725 high-pH protein sequences and 66 079 low-pH protein sequences, respectively. A deep learning model was constructed to identify protein acid-base tolerance based on amino acid sequences. The model showcased significantly higher accuracy than other methods, with the overall accuracy of 94.8%, precision of 91.8%, and a recall rate of 93.4% on the test set. Furthermore, we built a website (https://enzymepred.biodesign.ac.cn), which enabled users to predict the acid-base tolerance by submitting the protein sequences of enzymes. This study has accelerated the application of enzymes in various fields, including biotechnology, pharmaceuticals, and chemicals. It provides a powerful tool for the rapid screening and optimization of industrial enzymes.

    参考文献
    [1] TRENDELENBURG U. The interaction of transport mechanisms and intracellular enzymes in metabolizing systems[J]. Journal of Neural Transmission. Supplementum, 1990, 32: 3-18.
    [2] YAMAGATA Y, MAEDA H, NAKAJIMA T, ICHISHIMA E. The molecular surface of proteolytic enzymes has an important role in stability of the enzymatic activity in extraordinary environments[J]. European Journal of Biochemistry, 2002, 269(18): 4577-4585.
    [3] GREENER JG, KANDATHIL SM, MOFFAT L, JONES DT. A guide to machine learning for biologists[J]. Nature Reviews Molecular Cell Biology, 2022, 23: 40-55.
    [4] LEE BJ, SHIN MS, OH YJ, OH HS, RYU KH. Identification of protein functions using a machine-learning approach based on sequence-derived properties[J]. Proteome Science, 2009, 7: 27.
    [5] DING WZ, NAKAI KT, GONG HP. Protein design via deep learning[J]. Briefings in Bioinformatics, 2022, 23(3): bbac102.
    [6] ZHANG GY, LI HC, FANG BS. Discriminating acidic and alkaline enzymes using a random forest model with secondary structure amino acid composition[J]. Process Biochemistry, 2009, 44(6): 654-660.
    [7] LIN H, CHEN W, DING H. AcalPred: a sequence-based tool for discriminating between acidic and alkaline enzymes[J]. PLoS One, 2013, 8(10): e75726.
    [8] HOUGH DW, DANSON MJ. Extremozymes[J]. Current Opinion in Chemical Biology, 1999, 3(1): 39-46.
    [9] REIMER LC, SARDÀ CARBASSE J, KOBLITZ J, EBELING C, PODSTAWKA A, OVERMANN J. BacDive in 2022: the knowledge base for standardized bacterial and archaeal data[J]. Nucleic Acids Research, 2022, 50(D1): D741-D746.
    [10] ZIMMER C, LEUBA SI, YAESOUBI R, COHEN T. Use of daily Internet search query data improves real-time projections of influenza epidemics[J]. Journal of the Royal Society, Interface, 2018, 15(147): 20180220.
    [11] DiGIACOMO J, McKAY C, DAVILA A. ThermoBase: a database of the phylogeny and physiology of thermophilic and hyperthermophilic organisms[J]. PLoS One, 2022, 17(5): e0268253.
    [12] NEIRA G, CORTEZ D, JIL J, HOLMES DS. AciDB 1.0: a database of acidophilic organisms, their genomic information and associated metadata[J]. Bioinformatics, 2020, 36(19): 4970-4971.
    [13] TEUFEL F, ALMAGRO ARMENTEROS JJ, JOHANSEN AR, GÍSLASON MH, PIHL SI, TSIRIGOS KD, WINTHER O, BRUNAK S, von HEIJNE G, NIELSEN H. SignalP 6.0 predicts all five types of signal peptides using protein language models[J]. Nature Biotechnology, 2022, 40(7): 1023-1025.
    [14] KROGH A, LARSSON B, von HEIJNE G, SONNHAMMER EL. Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes[J]. Journal of Molecular Biology, 2001, 305(3): 567-580.
    [15] RIVES A, MEIER J, SERCU T, GOYAL S, LIN ZM, LIU J, GUO DM, OTT M, ZITNICK CL, MA J, FERGUS R. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences[J]. Proceedings of the National Academy of Sciences of the United States of America, 2021, 118(15): e2016239118.
    [16] LIN ZM, AKIN H, RAO R, HIE B, ZHU ZK, LU WT, SMETANIN N, VERKUIL R, KABELI O, SHMUELI Y, dos SANTOS COSTA A, FAZEL-ZARANDI M, SERCU T, CANDIDO S, RIVES A. Evolutionary-scale prediction of atomic-level protein structure with a language model[J]. Science, 2023, 379(6637): 1123-1130.
    [17] CHANG A, JESKE L, ULBRICH S, HOFMANN J, KOBLITZ J, SCHOMBURG I, NEUMANN-SCHAAL M, JAHN D, SCHOMBURG D. BRENDA. The ELIXIR core data resource in 2021: new developments and updates[J]. Nucleic Acids Research, 2021, 49(D1): D498-D508.
    [18] STEINEGGER M, SÖDING J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets[J]. Nature Biotechnology, 2017, 35(11): 1026-1028.
    [19] DAVID CC, JACOBS DJ. Principal component analysis: a method for determining the essential dynamics of proteins[J]. Methods in Molecular Biology, 2014, 1084: 193-226.
    [20] DIMITRIADIS G, NETO JP, KAMPFF AR. T-SNE visualization of large-scale neural recordings[J]. Neural Computation, 2018, 30(7): 1750-1774.
    [21] GUO C, CHEN XN, CHEN YH, YU CY. Multi-stage attentive network for motion deblurring via binary cross-entropy loss[J]. Entropy, 2022, 24(10): 1414.
    [22] NORTHCUTT C, JIANG L, CHUANG I. Confident learning: estimating uncertainty in dataset labels[J]. Journal of Artificial Intelligence Research, 2021, 70: 1373-1411.
    [23] FAN GL, LI QZ, ZUO YC. Predicting acidic and alkaline enzymes by incorporating the average chemical shift and gene ontology informations into the general form of Chou’s PseAAC[J]. Process Biochemistry, 2013, 48(7): 1048-1053.
    [24] KHAN ZU, HAYAT M, KHAN MA. Discrimination of acidic and alkaline enzyme using Chou’s pseudo amino acid composition in conjunction with probabilistic neural network model[J]. Journal of Theoretical Biology, 2015, 365: 197-203.
    相似文献
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

黄蓉,张鹤渐,吴敏,门志月,初环宇,白杰,常宏,程健,廖小平,刘玉万,宋亚囝,江会锋. AcidBasePred:基于深度学习的蛋白酸碱耐受性预测平台[J]. 生物工程学报, 2024, 40(12): 4670-4681

复制
分享
文章指标
  • 点击次数:262
  • 下载次数: 191
  • HTML阅读次数: 140
  • 引用次数: 0
历史
  • 收稿日期:2024-03-24
  • 在线发布日期: 2024-12-25
  • 出版日期: 2024-12-25
文章二维码
您是第5986982位访问者
生物工程学报 ® 2025 版权所有

通信地址:中国科学院微生物研究所    邮编:100101

电话:010-64807509   E-mail:cjb@im.ac.cn

技术支持:北京勤云科技发展有限公司