樣本選取及其集成式方法於醫療資料集之研究;Instance Selection and Its Ensemble on Medical Datasets

NCU Institutional Repository > 管理學院 > 資訊管理研究所 > 博碩士論文 > Item 987654321/98333

jsp.display-item.identifier=請使用永久網址來引用或連結此文件: https://ir.lib.ncu.edu.tw/handle/987654321/98333

题名:	樣本選取及其集成式方法於醫療資料集之研究;Instance Selection and Its Ensemble on Medical Datasets
作者:	吳文心;Wu, Wen-Hsin
贡献者:	資訊管理學系
关键词:	樣本選取;資料前處理;集成樣本選取;資料探勘;醫療數據;Instance Selection;Data Preprocessing;Ensemble Instance Selection;Data Mining;Medical Data
日期:	2025-07-22
上传时间:	2025-10-17 12:38:29 (UTC+8)
出版者:	國立中央大學
摘要:	隨著醫療資訊化推進，數據規模快速成長，帶來資料分析與決策的機會，卻也伴隨大量雜訊與冗餘資料，影響模型效能與運算效率。因此，資料前處理的重要性日益提高，特別是樣本選取技術，可藉由保留代表性樣本並排除不具資訊性的資料，提升分類表現並降低計算負擔。本研究針對醫療資料集，評估四種常見樣本選取方法（ENN、CNN、IPF、GA），並設計兩種集成策略（序列式、平行式），整合不同方法優勢以提升樣本選取之整體效能與穩定性。實驗採兩階段設計：第一階段比較四種方法與三種分類器（KNN、SVM、RF）之組合表現；第二階段則針對表現較好之三種方法進行集成策略實驗。實驗資料涵蓋多組公開醫療資料集，並採用五折交叉驗證進行比較，評估指標包括準確率（Accuracy）、AUC、資料精簡率與時間效率。研究結果顯示：(1) ENN 整體表現穩定，IPF 與 GA 在部分情境具優勢；(2)平行式集成策略整體優於序列式，特別是結合 ENN 與 IPF 的聯集或交集組合，可兼顧分類效能與資料精簡；(3)不同資料規模適用策略略有差異，大型資料集宜採交集型策略強化過濾，小資料集則應避免過度精簡導致資訊流失。本研究透過系統性樣本選取方法比較與集成策略設計，提出具可擴展性的樣本過濾框架與選擇準則，期能為醫療大數據分析中之資料前處理流程提供方法論支撐，進一步提升模型構建效率與預測準確性。;With the advancement of healthcare digitalization, the scale of medical data has grown rapidly, offering new opportunities for data analysis and decision-making. However, the increasing volume also introduces substantial noise and redundancy, adversely affecting model performance and computational efficiency. As a result, data preprocessing has become increasingly prominent, especially instance selection techniques, which can enhance classification performance and reduce computational burden by retaining representative samples and removing uninformative data. This study focuses on medical datasets and evaluates four commonly used sample selection methods—Edited Nearest Neighbors (ENN), Condensed Nearest Neighbors (CNN), Iterative Partitioning Filter (IPF), and Genetic Algorithm (GA). Two ensemble strategies sequential and parallel—that integrate the strengths of different methods to enhance overall performance and stability. The experiments are divided into two phases: the first compares the performance of the four methods combined with three classifiers (KNN, SVM, and RF); the second focuses on integrating the top-performing three methods using the proposed ensemble strategies. A variety of public medical datasets are used, and five-fold cross-validation is conducted. Evaluation metrics include accuracy, AUC, data reduction rate, and time efficiency. The results show that (1)ENN performs consistently well, while IPF and GA demonstrate advantages under specific conditions. (2)The parallel ensemble strategy generally outperforms the sequential one, especially combinations involving union or intersection of ENN and IPF, which strike a balance between performance and data reduction. (3)Strategy selection also varies by dataset size: intersection-based strategies are more suitable for large datasets, whereas over-reduction should be avoided in small datasets to prevent information loss. Through a systematic comparison of instance selection methods and ensemble strategy design, this study presents a scalable instance filtering framework and selection guidelines, aiming to provide methodological support for preprocessing in medical big data analysis and to enhance model-building efficiency and predictive accuracy.
显示于类别:	[資訊管理研究所] 博碩士論文

文件中的档案:

档案	描述	大小	格式	浏览次数
index.html		0Kb	HTML	47	检视/开启

在NCUIR中所有的数据项都受到原著作权保护.

社群 sharing

数据加载中.....