<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns="http://purl.org/rss/1.0/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/">
  <channel>
    <title>DSpace community: 系統生物與生物資訊研究所</title>
    <link>https://ir.lib.ncu.edu.tw/handle/987654321/194</link>
    <description />
    <items>
      <rdf:Seq>
        <rdf:li resource="https://ir.lib.ncu.edu.tw/handle/987654321/99285" />
        <rdf:li resource="https://ir.lib.ncu.edu.tw/handle/987654321/97685" />
        <rdf:li resource="https://ir.lib.ncu.edu.tw/handle/987654321/97681" />
        <rdf:li resource="https://ir.lib.ncu.edu.tw/handle/987654321/97677" />
      </rdf:Seq>
    </items>
  </channel>
  <textInput>
    <title>The community's search engine</title>
    <description>Search the Channel</description>
    <name>s</name>
    <link>https://ir.lib.ncu.edu.tw/simple-search</link>
  </textInput>
  <item rdf:about="https://ir.lib.ncu.edu.tw/handle/987654321/99285">
    <title>不同資料結構條件下之基因體 分析流程行為研究;Pipeline Behavior of Genomic Analysis under Different Data Structural Conditions</title>
    <link>https://ir.lib.ncu.edu.tw/handle/987654321/99285</link>
    <description>title: 不同資料結構條件下之基因體 分析流程行為研究;Pipeline Behavior of Genomic Analysis under Different Data Structural Conditions abstract: 本論文聚焦於高維基因體分析中之方法學規範與分析流程行為，研究目的並非探討特定疾病之致病機制，亦非鑑別個別遺傳變異，而是系統性檢視在不同資料尺度與結構條件下，基因體分析流程如何維持推論穩定性、方法一致性與驗證嚴謹性。研究中將疾病視為具不同統計特徵的分析載體，藉以觀察分析流程在異質資料環境中的行為調節與適用邊界。
本研究整合三類資料尺度之分析情境，包括系統層級的臨床表型共病網絡、小樣本高維基因體預測模型，以及大型族群基因體資料庫中的全基因組關聯分析與多基因風險評估。透過明確區分資料來源之結構特性與研究目的，並將其對應至不同分析模組，本論文避免將單一分析策略過度外推至不相容的資料條件，並清楚界定各模組在整體分析架構中的角色。
在大樣本且高度類別不平衡的資料情境中，本研究檢視分析流程於特徵蒸餾、不平衡處理與決策閾值設定下的行為穩定性；而在樣本數受限且高維特徵並存的情境中，則以模型行為的一致性與過度擬合風險作為主要評估焦點。研究結果顯示，分析流程的行為表現高度受到資料結構條件所調節，不同設計元件在各情境中呈現出功能角色的轉換，反映其在穩定推論與抑制偏誤上的方法學作用。
基於上述觀察，本論文強調模型效能指標應被理解為特定資料結構與決策設定下的行為描述，而非可跨情境直接比較的預測能力衡量。整體而言，本研究建立了一套以流程行為為核心的高維基因體分析方法學架構，明確區分描述性結構分析、模型行為評估與生物學詮釋之間的界線，並為後續在不同資料條件下進行基因體分析提供一個可重現、可審視且具調節彈性的研究基礎。
;This dissertation focuses on the methodological norms and behavioral properties of analytical pipelines in high-dimensional genomic analysis. Rather than aiming to elucidate disease-specific pathogenic mechanisms or to identify individual causal genetic variants, the primary objective is to systematically examine how genomic analysis pipelines maintain inferential stability, methodological consistency, and validation rigor across heterogeneous data scales and structural conditions. In this framework, diseases are treated as analytical carriers characterized by distinct statistical properties, allowing pipeline behavior and applicability boundaries to be investigated under varying data environments.
The study integrates three analytical scenarios spanning different data scales: system-level clinical phenotype comorbidity networks, small-sample high-dimensional genomic prediction models, and large population-based biobank data supporting genome-wide association analyses and polygenic risk assessment. By explicitly distinguishing the structural characteristics and analytical purposes of each data source and mapping them to corresponding analytical modules, this work avoids overextending single analytical strategies to incompatible data conditions and clearly delineates the functional roles of individual modules within the overall framework.
Under large-sample settings with pronounced class imbalance, the dissertation evaluates pipeline stability with respect to feature distillation, imbalance handling, and decision threshold configuration. In contrast, under sample-limited conditions with high-dimensional feature spaces, the primary focus shifts to the consistency of model behavior and the mitigation of overfitting risk. The results demonstrate that pipeline behavior is strongly modulated by data structure, with individual design components exhibiting role shifts across scenarios, reflecting their methodological functions in stabilizing inference and controlling structural bias.
Based on these observations, this dissertation emphasizes that model performance metrics should be interpreted as descriptive indicators of pipeline behavior under specific data structures and decision settings, rather than as measures of predictive capability that are directly comparable across contexts. Overall, this work establishes a pipeline-centered methodological framework for high-dimensional genomic analysis, clearly separating descriptive structural analysis, model behavior evaluation, and biological interpretation. The proposed framework provides a reproducible, inspectable, and adaptable foundation for genomic analyses conducted under diverse data conditions.
&lt;br&gt;</description>
  </item>
  <item rdf:about="https://ir.lib.ncu.edu.tw/handle/987654321/97685">
    <title>AutoGNN：以遺傳演算法驅動的圖神經網路，用於大規模人口之發病預測;AutoGNN: Genetic-Algorithm-Optimized Graph Neural Networks for Population-Scale Disease Onset Prediction</title>
    <link>https://ir.lib.ncu.edu.tw/handle/987654321/97685</link>
    <description>title: AutoGNN：以遺傳演算法驅動的圖神經網路，用於大規模人口之發病預測;AutoGNN: Genetic-Algorithm-Optimized Graph Neural Networks for Population-Scale Disease Onset Prediction abstract: 人口層級的第二型糖尿病（T2DM）「新發」風險預測，常受限於類別不平衡、特徵異質性，以及忽略關係結構的 i.i.d. 式流程。本文提出 AutoGNN——一個以中介資料（metadata）驅動的框架，將受試者隊列表徵為「人口圖」，並在固定運算預算下，對 GCN／GAT／GIN 進行「雙目標」遺傳式搜尋，同步選擇超參數與特徵區塊。研究族群來自台灣生物銀行（TWB；2012–2024）。新發風險隊列僅納入基線為非糖尿病者（N＝35,016；陽性＝1,187）；評估採嚴格分割與「病例錨定」的年齡—性別配對，並在 0→1（轉陽）與 0→0（持續陰性）上進行魯棒測試。

首次訓練的 AUROC 接近 SOTA：GCN 0.890、GIN 0.887、GAT 0.881，與多數納入近診斷等級檢驗之生物銀行／EHR 模型相當。相同組態重訓結果相近（典型 |∆|≈0.02–0.05，標準差適中）；隨機打散標籤的控制組則趨近機率水準（AUROC 約 0.45–0.60），顯示訊號為真，非種子運氣或洩漏。在相同預算下，MLP 偶有跑次在 AUROC 略勝，但 GCN 的整體準確度更佳，凸顯「關係歸納偏置」的價值。對 AUROC&gt;0.78 的模型進行 persona 分析（性別×年齡；教育分群）顯示：中年女性表現最佳；年長男性表現可接受且可監測；同時以 macro-/worst-F1 指標防止子群失效。GNNExplainer 強調 HBA1C 與空腹血糖在各 persona 中皆具關鍵性；而體態（腰臀比、BMI）與血脂（TG）在年輕男性與年長女性的權重更高——對臨床閾值設定與校準具有參考價值。

AutoGNN 在接近 SOTA 的區辨力之上，納入可重現性「衛生學」：固定分割／預算、重複實驗之平均與標準差、負向控制；並提供透明的次族群報告與「結構感知」的可稽核解釋，適合實務上線與審核。該框架亦可自然延伸至存活分析目標、聯邦式訓練，以及知識引導的拓樸學習。

關鍵詞： 第二型糖尿病；風險預測；生物銀行；圖神經網路；遺傳演算法；超參數優化；穩健性；公平性；校準；GNNExplainer；台灣生物銀行;Population-scale prediction of incident type 2 diabetes mellitus (T2DM) is challenged by
class imbalance, feature heterogeneity, and i.i.d. pipelines that ignore relational structure.
We present AutoGNN, a metadata-driven framework that casts the cohort as a population
graph and runs fixed-budget, dual-objective genetic search over GCN/GAT/GIN, jointly
selecting hyperparameters and feature blocks. The study population derives from Taiwan
Biobank (TWB; 2012–2024). The incident-risk cohort includes baseline non-diabetics
(N = 35,016; positives = 1,187); evaluation uses strict splits with case-anchored age–sex
matching and a robust test on 0→1 (positive) vs. 0→0 (negative).
First-run AUROC approaches state-of-the-art (SOTA): 0.890 (GCN), 0.887 (GIN),
0.881 (GAT), comparable to widely cited biobank/EHR models that often include near-
diagnostic labs [49, 71]. Same-config retrains stay close (typical |∆| ≈ 0.02–0.05; modest
SD), while shuffled-label controls collapse toward chance (AUROC ∼ 0.45–0.60), indi-
cating genuine signal rather than lucky seeds or leakage. Under equal budgets, MLP
may edge AUROC in some runs, but GCN yields better accuracy, underscoring the value
of relational inductive bias. Persona analyses (sex × age; education clusters) for models
with AUROC &gt; 0.78 show strongest performance in middle-aged females and acceptable,
monitorable performance in older males; macro-/worst-F1 guard against subgroup failure.
GNNExplainer highlights HBA1C and fasting glucose across personas, with anthropometry
(WHR, BMI) and lipids (TG) weighing more in younger males and older females—useful for
thresholding and calibration.
AutoGNN pairs near-SOTA discrimination with reproducibility hygiene (fixed splits/budgets,
repeat means/SDs, negative controls), transparent subgroup reporting, and structure-
aware explanations suitable for audit and deployment; it readily extends to survival ob-
jectives, federated training, and knowledge-guided topology learning.
Keywords: type 2 diabetes; risk prediction; biobank; graph neural networks; genetic al-
gorithm; hyperparameter optimization; robustness; fairness; calibration; GNNExplainer;
Taiwan Biobank.
&lt;br&gt;</description>
  </item>
  <item rdf:about="https://ir.lib.ncu.edu.tw/handle/987654321/97681">
    <title>MS-Ion: Unveiling Ion Associations in PTM-Enriched Proteomic Data</title>
    <link>https://ir.lib.ncu.edu.tw/handle/987654321/97681</link>
    <description>title: MS-Ion: Unveiling Ion Associations in PTM-Enriched Proteomic Data abstract: 診斷離子是指修飾側鏈的獨特碎片或中性損失，其特徵可用來區分修飾肽段與其未修飾的對應物及其他蛋白質修飾（PTMs）。然而，現有的診斷離子挖掘工具多侷限於偵測單一離子，往往忽略在更廣泛斷裂模式中可能存在的診斷離子關聯性。為了解決此問題，我們開發了 MS-Ion，一款應用 FP-Growth 演算法 的軟體工具。MS-Ion 可接受來自 MaxQuant 及 MSFragger/FragPipe 等傳統蛋白質資料庫搜尋工具所產生的 PSM（肽段與光譜匹配）以及對應的 mzML 檔作為輸入，並根據所指派的修飾類型對 PSM 進行分類。透過建立 FP-tree 並追蹤共現離子路徑，MS-Ion 能揭示與修飾類型相關的獨特離子關聯。本研究使用了四組資料集：一組乙醯化富集的 LUAD 資料集（PDC000224）、一組磷酸化富集的 LUAD 資料集（PDC000149）、一組賴胺酸乙醯化的合成肽段資料集（PXD009449），以及一組酪胺酸磷酸化的合成肽段資料集（PXD009449）。針對乙醯化，MS-Ion 共辨識出六種關鍵離子模式，其中最具代表性的是乙醯化賴胺酸的診斷離子（m/z 126.0913 與 143.1179），其信心值在 LUAD 與合成資料集中皆高於 95%。此外也偵測到與之共現的乙醯化 y 離子（LUAD 資料集中的 m/z 189.1234 及合成資料集中的 m/z 317.2183）。在 LUAD 資料集中結合 m/z 126.0913、143.1179 和 189.1234 三個診斷離子，或在合成資料集中結合 m/z 126.0913、143.1179 和 317.2183，可達到 100% 的特異性，有效區分真正的修飾 PSM 與誤分類的未修飾 PSM。針對磷酸化，MS-Ion 驗證 m/z 216.0426 為酪胺酸磷酸化診斷離子，與過去研究結果相符。該離子與八個磷酸化的 b/y 離子在 LUAD 與合成資料集中共同出現。另有一個特徵離子（m/z 439.1701）頻繁出現在絲胺酸磷酸化的 PSM 中；而對於蘇胺酸的磷酸化則發現新的組合模式（m/z 122.0288 與 195.0815），在複合磷酸化位點中也觀察到 m/z 216.0426 與 195.0815 的共現。所有與磷酸化相關的離子模式皆展現出極高的特異性（90–100%）。統計分析（包含卡方獨立性檢定，p &lt; .01）及 UMAP 降維方法皆進一步驗證所偵測離子模式的顯著性與區辨性。MS-Ion 具備互動式介面，使用者可自訂支持度與信心門檻，是一款可協助研究人員進行 PTM 探索的高效能工具。;Diagnostic ions are characterized by unique fragments of modified side chains or neutral losses, distinguishing modified peptides from their unmodified counterparts and other PTMs. However, current diagnostic ion mining tools are limited to detecting single ions, often ignoring potential associations between diagnostic ions across broader fragmentation patterns. To address the issue, we developed MS-Ion, a software tool that applies the FP-Growth algorithm. MS-Ion takes PSMs from conventional protein database search tools, such as MaxQuant and MSFragger/FragPipe, along with the corresponding mzML files as input. It first separates PSMs based on their assigned modifications. By constructing an FP-tree and tracing co-occurring ion paths, MS-Ion uncovers unique associations linked to modifications. Four datasets were used, including an acetylation-enriched LUAD dataset (PDC000224), a phosphorylation-enriched LUAD dataset (PDC000149), a lysine-acetylated synthetic peptide dataset (PXD009449), and a tyrosine-phosphorylated synthetic peptide dataset (PXD009449). For acetylation, MS-Ion identified six key ion patterns. The most significant involved acetylated lysine diagnostic ions (m/z 126.0913 and 143.1179), with confidence values above 95% across the LUAD and synthetic datasets. Co-occurring acetylated y ions (m/z 189.1234 in the LUAD dataset and 317.2183 in the synthetic dataset) were also detected. By combining the three diagnostic ions (m/z 126.0913, 143.1179, and 189.1234 in LUAD datasets; m/z 126.0913, 143.1179, and 317.2183 in synthetic datasets), MS-Ion achieved 100% specificity, effectively distinguishing true positives from misclassified unmodified PSMs. For phosphorylation, MS-Ion confirmed that m/z 216.0426 represented tyrosine phosphorylation, corresponding with previous studies. This ion co-occurred with eight phosphorylated b/y ions across LUAD and synthetic peptide datasets. Additionally, a unique ion (m/z 439.1701) was frequently observed in serine phosphorylation PSMs, while novel phosphorylation patterns for threonine (m/z 122.0288 with 195.0815) and combined phosphorylation sites (m/z 216.0426 with 195.0815) were discovered. All phosphorylation-related patterns showed exceptional specificity (90-100%). Statistical analyses, including Chi-square independence tests (p &lt; .01) and the UMAP algorithm, confirmed the significance and distinctiveness of the detected ion patterns. With an interactive interface allowing customizable support and confidence thresholds, MS-Ion provides researchers with a high-performance tool for PTM discovery.
&lt;br&gt;</description>
  </item>
  <item rdf:about="https://ir.lib.ncu.edu.tw/handle/987654321/97677">
    <title>Interpretation and Knowledge Extraction of Traditional Chinese Medicine Classics in Text Mining</title>
    <link>https://ir.lib.ncu.edu.tw/handle/987654321/97677</link>
    <description>title: Interpretation and Knowledge Extraction of Traditional Chinese Medicine Classics in Text Mining abstract: 本研究結合古代中醫知識與現代計算方法，運用資料與文本探勘、 Apriori 演算法 與網絡分析，挖掘《普濟方》等文獻中的草藥組合與應用模式。透過關鍵字提取、命名 實體識別及 PubMed 基因資料交叉分析，探索中醫在抗菌與糖尿病等疾病治療的潛力。 研究顯示：（ 1）歷史方劑強調藥效與風味；（ 2）地榆-澤瀉、苦參-生薑等組合具抗菌活 性；（ 3）「消渴門」草藥與代謝途徑高度相關。另應用資料探勘技術提出潛在新配方， 結合分子預測工具分析其化學成分與活性，展現中醫與現代生物資訊整合之可能。研究 提供新穎資料驅動框架，助攻個人化醫療與永續藥物發現。;This study explores the potential of Traditional Chinese Medicine (TCM) through computational methods, integrating ancient wisdom with modern drug discovery and sustainability advancements. TCM′s historical literature provides a valuable resource for analyzing classical texts like the Pu-Ji Fang through data mining, text mining, and network analysis. The main objective is to explore new therapeutic drug candidates, analyze herb usage patterns, and generate novel herbal formulations. One aspect investigates TCM’s role in combating microbial infections by applying the Apriori algorithm and case studies to explore traditional remedies, while another examines its potential for treating widespread diseases like diabetes. Sophisticated methodologies included a novel iterative keyword extraction method and association rules to identify key herb pairs from historical TCM texts, which studies cross-referenced with pharmacogenomic data from PubMed. Named Entity Recognition (NER) and external knowledge graphs analyzed herbal formulas related to specific organs and diseases, such as &amp;quot;XiaoKe&amp;quot; (diabetes). The Apriori algorithm identified frequent herb combinations, while tools like DAVID analyzed herb-to-gene networks, revealing biological functions and therapeutic potentials. Apriori-based learning reveals novel herbal formulations from frequent textual patterns. An antimicrobial molecular prediction tool analyzed the chemical composition of these herbs to identify antimicrobial effects. The integrated methods revealed insights into TCM: (1) Analysis of Pu-Ji Fang indicated that historical prescriptions emphasized medicinal value and flavor; (2) Herb combinations like DiYu → ZeXie and KuShen → ShengJiang demonstrated potential antimicrobial activity; (3) Network analysis of &amp;quot;XiaoKe&amp;quot; herbs highlighted associations with metabolic pathways, suggesting roles in regulation and metabolism. Additionally, the Apriori algorithm rapidly explored novel herbal combinations in ancient literature. Extensive data, including PubMed gene-herb entries, highlighted the potential of linking historical herbal knowledge with modern genetics. In conclusion, this study underscores the value of combining ancient TCM with modern science. Techniques such as data mining and network analysis deepen TCM insights and support new drug discovery. These methods may aid in personalized medicine and the development of sustainable treatments for infections and metabolic diseases.
&lt;br&gt;</description>
  </item>
</rdf:RDF>

