姓名 陳毅寰(Yi-Huan Chen)  查詢紙本館藏   畢業系所 資訊管理學系
論文名稱 分治式樣本選取法於巨量資料探勘之研究
摘要(中) 巨量資料時代來臨,隨著處理的資料跳躍式成長,資料的雜訊也隨之增加,因此我們在資料探勘之前先進行資料的樣本選取,把雜訊去掉留下代表性資料,以確保後續資料探勘的品質。
本研究提出了分散式樣本選取流程架構DCIS,藉由Divide and Conquer的概念把問題簡化成數個子問題,依序各個擊破進行樣本選取且最後再進行一次匯集篩選,以提升選取品質,並讓不同的樣本選取演算法在本DCIS的架構中都能獲得選取效果的提升。
摘要(英) In the big data era, data grows rapidly and so does noisy data. We need to do instance selection as data pre-processing to pick out representative data before mining the insight from data and keep the result qualified.
As the amount of data grows up, the computational complexity of performing instance selection can increase. It also affects the results of data selection and data mining. Additionally, no instance selection algorithm can provide the best result for every data set. There is no the best solution for each problem.
In this work, we propose a divide and conquer-based instance selection framework, namely DCIS. First, it breaks the original data set into smaller sub-datasets and makes them in several groups. Second, it uses an instance selection algorithm to get representative data from each group sequentially. Last, it combines each part into one set as the final result after instance selection.
We use small data sets to examine the performances of DCIS with different numbers of sub-datasets in the first step of DCIS and different ways of combination in the final step of DCIS. Moreover, large scale datasets are also used to assess the applicability of DCIS. The experimental result shows that DCIS is a suitable framework to enhance the performance of instance selection over both small and large scale datasets.
關鍵字(中) ★ 巨量資料
★ Divide and Conquer
★ 資料前處理(樣本選取)
★ 分類探勘
關鍵字(英) ★ big data
★ divide and conquer
★ instance selection
★ classification
論文目次 中文摘要 i
第一章 緒論.....................................1
1.1 研究背景.....................................1
1.2 研究動機.....................................2
1.3 研究目的.....................................4
1.4 研究架構.....................................6
第二章 文獻探討.................................8
2.1 Big data....................................8
2.2 Divide and Conquer..........................10
2.3 Instance Selection..........................11
2.3.1 DROP3.....................................13
2.3.2 IB3.......................................15
2.4 Classification Technique................... 16
2.4.1 KNN (k-nearest neighbor classification)...17
2.4.2 SVM (Support Vector Machine)..............18
2.5 Related Works...............................21
第三章 實驗方法.................................23
3.1 研究一.......................................23
3.2 研究二.......................................30
3.3 研究三...................................31
第四章 實驗結果.................................34
4.1 實驗設定.....................................34
4.1.1 資料集.....................................34
4.1.2實驗環境設定 ................................38
4.1.3 模型驗證準則...............................38
4.2 實驗結果.....................................40
4.2.1 研究一結果.................................40
4.2.2 研究二結果.................................47
4.2.3 研究三結果.................................51
4.3 討論與建議...................................53
第五章 結論.....................................54
5.1 結果與貢獻...................................54
5.2 限制與未來發展...............................55
指導教授 蔡志豐(Chih-Fong Tsai) 審核日期 2016-8-5
