| 摘要: | 本論文聚焦於高維基因體分析中之方法學規範與分析流程行為,研究目的並非探討特定疾病之致病機制,亦非鑑別個別遺傳變異,而是系統性檢視在不同資料尺度與結構條件下,基因體分析流程如何維持推論穩定性、方法一致性與驗證嚴謹性。研究中將疾病視為具不同統計特徵的分析載體,藉以觀察分析流程在異質資料環境中的行為調節與適用邊界。 本研究整合三類資料尺度之分析情境,包括系統層級的臨床表型共病網絡、小樣本高維基因體預測模型,以及大型族群基因體資料庫中的全基因組關聯分析與多基因風險評估。透過明確區分資料來源之結構特性與研究目的,並將其對應至不同分析模組,本論文避免將單一分析策略過度外推至不相容的資料條件,並清楚界定各模組在整體分析架構中的角色。 在大樣本且高度類別不平衡的資料情境中,本研究檢視分析流程於特徵蒸餾、不平衡處理與決策閾值設定下的行為穩定性;而在樣本數受限且高維特徵並存的情境中,則以模型行為的一致性與過度擬合風險作為主要評估焦點。研究結果顯示,分析流程的行為表現高度受到資料結構條件所調節,不同設計元件在各情境中呈現出功能角色的轉換,反映其在穩定推論與抑制偏誤上的方法學作用。 基於上述觀察,本論文強調模型效能指標應被理解為特定資料結構與決策設定下的行為描述,而非可跨情境直接比較的預測能力衡量。整體而言,本研究建立了一套以流程行為為核心的高維基因體分析方法學架構,明確區分描述性結構分析、模型行為評估與生物學詮釋之間的界線,並為後續在不同資料條件下進行基因體分析提供一個可重現、可審視且具調節彈性的研究基礎。 ;This dissertation focuses on the methodological norms and behavioral properties of analytical pipelines in high-dimensional genomic analysis. Rather than aiming to elucidate disease-specific pathogenic mechanisms or to identify individual causal genetic variants, the primary objective is to systematically examine how genomic analysis pipelines maintain inferential stability, methodological consistency, and validation rigor across heterogeneous data scales and structural conditions. In this framework, diseases are treated as analytical carriers characterized by distinct statistical properties, allowing pipeline behavior and applicability boundaries to be investigated under varying data environments. The study integrates three analytical scenarios spanning different data scales: system-level clinical phenotype comorbidity networks, small-sample high-dimensional genomic prediction models, and large population-based biobank data supporting genome-wide association analyses and polygenic risk assessment. By explicitly distinguishing the structural characteristics and analytical purposes of each data source and mapping them to corresponding analytical modules, this work avoids overextending single analytical strategies to incompatible data conditions and clearly delineates the functional roles of individual modules within the overall framework. Under large-sample settings with pronounced class imbalance, the dissertation evaluates pipeline stability with respect to feature distillation, imbalance handling, and decision threshold configuration. In contrast, under sample-limited conditions with high-dimensional feature spaces, the primary focus shifts to the consistency of model behavior and the mitigation of overfitting risk. The results demonstrate that pipeline behavior is strongly modulated by data structure, with individual design components exhibiting role shifts across scenarios, reflecting their methodological functions in stabilizing inference and controlling structural bias. Based on these observations, this dissertation emphasizes that model performance metrics should be interpreted as descriptive indicators of pipeline behavior under specific data structures and decision settings, rather than as measures of predictive capability that are directly comparable across contexts. Overall, this work establishes a pipeline-centered methodological framework for high-dimensional genomic analysis, clearly separating descriptive structural analysis, model behavior evaluation, and biological interpretation. The proposed framework provides a reproducible, inspectable, and adaptable foundation for genomic analyses conducted under diverse data conditions. |