使用語言模型嵌入和不平衡調整之深度學習方 法識別多功能抗菌肽;A Deep Learning Approach with Language Model Embeddings and Imbalance Adjustment for Identifying Multi-Functional Antimicrobial Peptides

NCUIR > College of Electrical Engineering & Computer Science > Graduate Institute of Computer Science and Information Engineering > Electronic Thesis & Dissertation > Item 987654321/93326

Please use this identifier to cite or link to this item: http://ir.lib.ncu.edu.tw/handle/987654321/93326

Title:	使用語言模型嵌入和不平衡調整之深度學習方法識別多功能抗菌肽;A Deep Learning Approach with Language Model Embeddings and Imbalance Adjustment for Identifying Multi-Functional Antimicrobial Peptides
Authors:	林駿燊;Lin, Jun-Shen
Contributors:	資訊工程學系
Keywords:	不平衡前處理;語言模型嵌入層;多標籤分類器;抗菌肽;不平衡損失函數;Imbalanced preprocessing;Language model embeddings;Multi-label classifier;Antimicrobial peptides;Imbalanced loss function
Date:	2023-07-27
Issue Date:	2024-09-19 16:53:52 (UTC+8)
Publisher:	國立中央大學
Abstract:	抗生素抗藥性是當今世界所面臨的一個嚴重問題。為了應對這個問題，我們需要尋找替代的治療策略。持續的研究和開發抗微生物肽對未來的抗微生物治療有著巨大的潛力。近年來大多數針對抗微生物肽的多標籤深度學習研究旨在區分多個功能活性，沒有討論預處理方法以及損失函數在多標籤不平衡的差別，且較缺乏使用蛋白質語言模型之特徵的研究，而蛋白質語言模型之相關研究也缺乏模型架構的比較。在這項研究中，我們通過算法適應方法並比較不同種處理不平衡的方法：損失函數、前處理、語言模型嵌入層、模型架構以及特徵種類，找出整體性能較佳的深度學習模型分類器，用以預測五種不同活性功能肽。我們採用非對稱不平衡損失函數以及觀察前處理的影響，結果顯示我們提出的模型架構，對於細菌、哺乳動物細胞、真菌、病毒和癌細胞的分類在整體評估的絕對正確評估上可以達到 0.625，在絕對錯誤評估上達到 0.118。而在各別標籤結果評估上，當我們使用多標籤欠採樣 (Multi-label Undersampling) 時能使其宏觀平衡準確率(Balanced Accuracy) 從 0.780 提升為 0.801，並且觀察到前處理方法中，數據量對於深度學習的重要性。若採用針對實例重要性而篩選的前處理，能達到剃除少量資料的同時幫助資料上的平衡。在損失函數上，使用非對稱的損失函數亦能幫助少量標籤的預測能力，使其整體評估的絕對正確上由 0.601 提升至 0.625。在模型方面，蛋白質語言嵌入層的特徵，在使用卷積神經網路 (Convolutional neural network, CNN) 搭配雙向長短期記憶網路(Bi-directional Long Short-Term Memory, BiLSTM) 的架構下有著最佳的結果，而非簡易的卷積神經網路架構抑或是更複雜的多頭自注意力機制 (Multi-Head Self-Attention) 能達到最佳結果，其其整體評估的絕對正確上結果分別由 0.614 以及 0.592 提升至 0.625。;Antibiotic resistance is a serious problem faced by the world today, making the treatment of bacterial infections increasingly challenging. To address this issue, alternative therapeutic strategies need to be explored. Ongoing research and development of antimicrobial peptides (AMPs) hold tremendous potential for future antimicrobial therapies. However, most recent studies on multi-label deep learning for AMPs focus on differentiating multi-functional classes without discussing preprocessing methods and the differences in loss functions for imbalanced multi-label data. Moreover, there is a lack of research utilizing protein language model features, and existing studies also lack comparisons of model architectures in language model embeddings feature. To analyze these differences and identify better results, in this study, we employ algorithm adaptation methods and compare various approaches for handling data imbalance, including loss functions, preprocessing techniques, language model embeddings, model architectures, and feature types. The goal is to find a deep learning model classifier with superior overall performance for predicting five different active functional peptides. We utilized asymmetric loss functions and observe the impact of preprocessing. The results show that our proposed model architecture achieves an absolute true of 0.625 and an absolute false of 0.118 in the overall evaluation for the classification of bacteria, mammalian cells, fungi, viruses, and cancer cells.Regarding individual label result evaluations, when employing multi-label undersampling (Multi-label Undersampling), we can improve the macro balanced accuracy (BA) from 0.780 to 0.801. We also observed the influence of data quantity on deep learning through preprocessing methods. Preprocessing that selects instances based on their importance can help achieve data balance while removing a small amount of data. Additionally, using asymmetric loss functions (ASL) in the training process improves the predictive ability of minority labels, resulting in an increase in the overall Absolute true score from 0.601 to 0.625. In terms of the model architecture, the protein language embeddings layer performs best when combined with a Convolutional Neural Network (CNN) and a Bidirectional Long Short-Term Memory (BiLSTM) network, rather than using a simple CNN architecture or a more complex Multi-Head Self-Attention mechanism. This architecture resulted in an improvement in overall evaluation accuracy from 0.614 and 0.592 to 0.625.
Appears in Collections:	[Graduate Institute of Computer Science and Information Engineering] Electronic Thesis & Dissertation

Files in This Item:

File	Description	Size	Format
index.html		0Kb	HTML	13	View/Open

社群 sharing

Loading...