| 摘要: | 蛋白質磷酸化是一種重要的轉譯後修飾作用,調控細胞內幾乎所有的訊息傳遞途徑,對於細胞增殖、代謝、分化及凋亡等多種生物過程均具關鍵作用。儘管高通量磷酸化蛋白質體學已有廣泛研究,但目前實驗驗證的磷酸化位點中,僅不到5%的位點能明確指出其特定的上游激酶,導致磷酸化介導的訊息路徑與疾病機制上仍理解有限。傳統計算方法大多依賴序列模體(Motif)或局部序列特徵,具有泛化能力不足且忽略更廣泛生物情境及網路層級關係的缺點。本研究發展出一套歸納式計算框架,系統性地整合異構圖注意力網路(Graph Attention Network, GAT)與預訓練蛋白質語言模型(Evolutionary Scale Modeling version 2, ESM2),用以預測激酶與底物間之磷酸化關係。提出之模型建構一個異構圖,將激酶與磷酸化位點視為不同類型的節點,透過實驗驗證的激酶-底物關係以及基於生物資訊嵌入的相似性邊進行連接。ESM2模型可提供豐富、高維度的蛋白質嵌入表徵,有效捕捉蛋白質及磷酸化胜肽之演化特徵、生化性質及結構資訊;GAT模型則進一步動態聚合這些嵌入資訊,在局部及全域圖結構情境中學習複雜的激酶-底物交互模式,以實現對新型激酶-底物配對的歸納式推理能力。透過嚴謹的基準資料集以及負樣本驗證,本研究建構之模型在獨立測試集上達到0.9635的受試者操作曲線下面積(Area Under the Receiver Operating Characteristic Curve, AUC),優於Phosformer及KinasePhos 3.0。此外,針對包含CDK及MAPK之不同激酶家族,分析亦顯示本模型之優異泛化能力。透過深入的生物學案例探討,包括MAP3K10介導的SMAD5磷酸化與CDC7介導的AAAS磷酸化,本研究結合訊息傳導路徑分析、跨世代轉錄體相關性、臨床預後評估及蛋白質-胜肽結構對接模擬等多層次驗證,進一步顯示本研究模型所預測之激酶-底物交互作用具有生物合理性與。本研究提出之歸納式計算框架,透過深度學習方法、序列資訊嵌入及基於圖譜的歸納式推理可增進激酶-底物磷酸化位點的預測能力,將有助於磷酸化蛋白質體學發現與生物學意涵之解釋,推展新型訊息傳導機制與治療標靶的鑑定及發掘。;Protein phosphorylation, a fundamental post-translational modification, regulates all aspects of cellular signaling and plays critical roles in diverse biological processes such as cell proliferation, metabolism, differentiation, and apoptosis. Despite extensive high-throughput phosphoproteomics research, fewer than 5% of experimentally validated phosphorylation sites are associated with their specific kinases, creating a substantial knowledge gap that limits our understanding of phosphorylation-mediated signaling pathways and related disease mechanisms. Traditional computational approaches, depending primarily on sequence motifs or local sequence features, suffer from limited generalizability and ignore broader biological contexts and network-level relationships. In this study, we develop an inductive computational framework incorporating heterogeneous graph attention networks (GAT) with pretrained protein language models (evolutionary scale modeling version 2, ESM2) to predict kinase-substrate phosphorylation relationships systematically. Our proposed model constructs a heterogeneous graph wherein kinases and phosphosites are represented as distinct nodes connected by experimentally validated kinase-substrate interactions and similarity-based edges derived from biologically informed embeddings. The ESM2 model provides rich, high-dimensional embeddings capturing evolutionary, biochemical, and structural properties of proteins and phosphopeptides. Subsequently, the GAT dynamically aggregates these embeddings, learning to capture complex kinase-substrate interactions within local and global graph contexts, enabling robust inductive inference for novel kinase-substrate pairs. Rigorous evaluation using curated benchmark datasets and advanced negative sampling strategies demonstrated superior predictive performance, with our model achieving an area under the receiver operating characteristic curve (AUC) of 0.9635, exceeding state-of-the-art tools such as Phosformer and KinasePhos 3.0. Further analyses validated the model’s robust generalizability across diverse kinase families, including the CDK and MAPK groups. Through biological case studies, including MAP3K10-mediated SMAD5 phosphorylation and CDC7-mediated AAAS phosphorylation, we provided multi-layered validation—comprising pathway analyses, cross-cohort transcriptomic correlations, clinical outcome assessments, and peptide-protein structural docking—that strongly support these computationally predicted kinase-substrate interactions as biologically reasonable, experimentally testable hypotheses. In conclusion, an inductive computational framework integrating deep learning methods, sequence-informed protein embeddings, and graph-based inductive reasoning enhances kinase-substrate phosphorylation site prediction. It provides advancements in bridging the gap between phosphoproteomic discoveries and biological interpretation, facilitating the identification of novel signaling mechanisms and therapeutic targets. |