dc.description.abstract | Antibiotic resistance is a serious problem faced by the world today, making the treatment of bacterial infections increasingly challenging. To address this issue, alternative therapeutic strategies need to be explored. Ongoing research and development of antimicrobial peptides (AMPs) hold tremendous potential for future antimicrobial therapies. However, most recent studies on multi-label deep learning for AMPs focus on differentiating multi-functional classes without discussing preprocessing methods and the differences in loss functions for imbalanced multi-label data. Moreover, there is a lack of research utilizing protein language model features, and existing studies also lack comparisons of model architectures in language model embeddings feature. To analyze these differences and identify better results, in this study, we employ algorithm adaptation methods and compare various approaches for handling data imbalance, including loss functions, preprocessing techniques, language model embeddings, model architectures, and feature types. The goal is to find a deep learning model classifier with superior overall performance for predicting five different active functional peptides. We utilized asymmetric loss functions and observe the impact of preprocessing. The results show that our proposed model architecture achieves an absolute true of 0.625 and an absolute false of 0.118 in the overall evaluation for the classification of bacteria, mammalian cells, fungi, viruses, and cancer cells.Regarding individual label result evaluations, when employing multi-label undersampling (Multi-label Undersampling), we can improve the macro balanced accuracy (BA) from 0.780 to 0.801. We also observed the influence of data quantity on deep learning through preprocessing methods. Preprocessing that selects instances based on their importance can help achieve data balance while removing a small amount of data. Additionally, using asymmetric loss functions (ASL) in the training process improves the predictive ability of
minority labels, resulting in an increase in the overall Absolute true score from 0.601 to 0.625. In terms of the model architecture, the protein language embeddings layer performs best when combined with a Convolutional Neural Network (CNN) and a Bidirectional Long Short-Term Memory (BiLSTM) network, rather than using a simple CNN architecture or a more complex Multi-Head Self-Attention mechanism. This architecture resulted in an improvement in overall evaluation accuracy from 0.614 and 0.592 to 0.625. | en_US |