分治式樣本選取法於巨量資料探勘之研究

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：83

、訪客IP：3.137.220.166

姓名

陳毅寰(Yi-Huan Chen) 查詢紙本館藏

畢業系所

資訊管理學系

論文名稱

分治式樣本選取法於巨量資料探勘之研究

相關論文

★ 利用資料探勘技術建立商用複合機銷售預測模型	★ 應用資料探勘技術於資源配置預測之研究-以某電腦代工支援單位為例
★ 資料探勘技術應用於航空業航班延誤分析-以C公司為例	★ 全球供應鏈下新產品的安全控管-以C公司為例
★ 資料探勘應用於半導體雷射產業-以A公司為例	★ 應用資料探勘技術於空運出口貨物存倉時間預測-以A公司為例
★ 使用資料探勘分類技術優化YouBike運補作業	★ 特徵屬性篩選對於不同資料類型之影響
★ 資料探勘應用於B2B網路型態之企業官網研究-以T公司為例	★ 衍生性金融商品之客戶投資分析與建議-整合分群與關聯法則技術
★ 應用卷積式神經網路建立肝臟超音波影像輔助判別模型	★ 基於卷積神經網路之身分識別系統
★ 能源管理系統電能補值方法誤差率比較分析	★ 企業員工情感分析與管理系統之研發
★ 資料淨化於類別不平衡問題: 機器學習觀點	★ 資料探勘技術應用於旅客自助報到之分析—以C航空公司為例

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

巨量資料時代來臨，隨著處理的資料跳躍式成長，資料的雜訊也隨之增加，因此我們在資料探勘之前先進行資料的樣本選取，把雜訊去掉留下代表性資料，以確保後續資料探勘的品質。
　　但隨著資料點的數量級大到一定程度，樣本選取前處理的複雜度大增，使選取效果受到影響，進而影響後續資料探勘結果。另外，不同的樣本選取演算法在不同的資料集或是問題上，其選取效果有優有劣，不可能有任何一個演算法在所有資料集都有最佳的選取效果。
本研究提出了分散式樣本選取流程架構DCIS，藉由Divide and Conquer的概念把問題簡化成數個子問題，依序各個擊破進行樣本選取且最後再進行一次匯集篩選，以提升選取品質，並讓不同的樣本選取演算法在本DCIS的架構中都能獲得選取效果的提升。
本研究使用小型資料集，逐步實驗不同的匯集篩選方式、分類器和分散的群組數，最後確定DCIS方法架構，以大型資料集實驗DCIS之成效。結果顯示DCIS成功的讓不同的樣本選取演算法，在面對大型或是小型資料集時，都獲得了樣本選取品質的提升，進而幫助後續的資料探勘結果。

摘要(英)

In the big data era, data grows rapidly and so does noisy data. We need to do instance selection as data pre-processing to pick out representative data before mining the insight from data and keep the result qualified.
As the amount of data grows up, the computational complexity of performing instance selection can increase. It also affects the results of data selection and data mining. Additionally, no instance selection algorithm can provide the best result for every data set. There is no the best solution for each problem.
In this work, we propose a divide and conquer-based instance selection framework, namely DCIS. First, it breaks the original data set into smaller sub-datasets and makes them in several groups. Second, it uses an instance selection algorithm to get representative data from each group sequentially. Last, it combines each part into one set as the final result after instance selection.
We use small data sets to examine the performances of DCIS with different numbers of sub-datasets in the first step of DCIS and different ways of combination in the final step of DCIS. Moreover, large scale datasets are also used to assess the applicability of DCIS. The experimental result shows that DCIS is a suitable framework to enhance the performance of instance selection over both small and large scale datasets.

關鍵字(中)

★ 巨量資料
★ Divide and Conquer
★ 資料前處理(樣本選取)
★ 分類探勘

關鍵字(英)

★ big data
★ divide and conquer
★ instance selection
★ classification

論文目次

中文摘要 i
Abstract........................................ii
目錄............................................iii
圖目錄...........................................iv
表目錄...........................................v
第一章緒論.....................................1
1.1 研究背景.....................................1
1.2 研究動機.....................................2
1.3 研究目的.....................................4
1.4 研究架構.....................................6
第二章文獻探討.................................8
2.1 Big data....................................8
2.2 Divide and Conquer..........................10
2.3 Instance Selection..........................11
2.3.1 DROP3.....................................13
2.3.2 IB3.......................................15
2.4 Classification Technique................... 16
2.4.1 KNN (k-nearest neighbor classification)...17
2.4.2 SVM (Support Vector Machine)..............18
2.5 Related Works...............................21
第三章實驗方法.................................23
3.1 研究一.......................................23
3.2 研究二.......................................30
3.3 研究三...................................31
第四章實驗結果.................................34
4.1 實驗設定.....................................34
4.1.1 資料集.....................................34
4.1.2實驗環境設定 ................................38
4.1.3 模型驗證準則...............................38
4.2 實驗結果.....................................40
4.2.1 研究一結果.................................40
4.2.2 研究二結果.................................47
4.2.3 研究三結果.................................51
4.3 討論與建議...................................53
第五章結論.....................................54
5.1 結果與貢獻...................................54
5.2 限制與未來發展...............................55
參考文獻.........................................56

參考文獻

[1] Mayer-Schönberger; Kenneth Cukier (2013). Big Data: A Revolution that Will Transform how We Live, Work, and Think. Houghton Mifflin Harcourt. ISBN 0-544-00269-5

[2]Laney, D. (2001). 3D data management: Controlling data volume, velocity and variety. META Group Research Note, 6.

[3] Tan, P.N., Steinbach, M., and Kumar, V., 2006, “Introduction to Data Mining.” Addison Wesley.

[4]Pyle, D., 1999,” Data Preparation for Data Mining.” Morgan Kaufmann

[5]Kotsiantis, S.B., Kanellopoulos, D. and Pintelas, P.E., 2006, “Data Preprocessing for Supervised Leaning.” Intermational Journal of Computer Science, vol.1, pp.1306-4428.

[6] Cano, J. R., Herrera, F., & Lozano, M. (2003). Using evolutionary algorithms as instance selection for data reduction in KDD: an experimental study. Evolutionary Computation, IEEE Transactions on, 7(6), 561-575.

[7]Olvera-López, J. A., Carrasco-Ochoa, J. A., Martínez-Trinidad, J. F., & Kittler, J. (2010). A review of instance selection methods. Artificial Intelligence Review, 34(2), 133-143.

[8]A. Haro-Garcı´a and N. Garcı´a-Pedrajas, “A Divide-and-Conquer Recursive Approach for Scaling Up Instance Selection Algorithms,” Data Mining and Knowledge Discovery, vol. 18, no. 3, pp. 392-418, 2009.

[9] D. H. Wolpert and W. G. Macready, “No free lunch theorems for optimization,” IEEE Trans. Evol. Comput., vol. 1, pp. 67–82, Apr. 1997.

[10] Hashem IAT, Yaqoob I, Anuar NB, Mokhtar S, Gani A, Ullah Khan S (2015) The rise of “big data” on cloud computing: review and open research issues. Inform Syst 47:98–115. doi:10.1016/j.is.2014.07.006

[11] J. Manyika, M. Chui, B. Brown, J. Bughin, R. Dobbs, C. Roxburgh, A.H. Byers, Big data: The next frontier for innovation, competition, and productivity,(2011).

[12] P. Zikopoulos, K. Parasuraman, T. Deutsch, J. Giles, D. Corrigan, Harness the Power of Big Data The IBM Big Data Platform, McGraw Hill Professional,2012.

[13] J.J. Berman, Introduction, in: Principles of Big Data, Morgan Kaufmann, Boston, 2013,xix–xxvi (pp).

[14] Jie, L., Zheng, X., Yayun, J., & Rui, Z. (2014, 18-20 Aug. 2014). The overview of big data storage and management. Paper presented at the Cognitive Informatics & Cognitive Computing (ICCI*CC), 2014 IEEE 13th International Conference on.

[15] J. P. Dijcks. Oracle: Big data for the enterprise.Oracle White Paper, 2012.

[16] R. Rugina, M. Rinard, ”Recursion unrolling for divide and conquer programs”, in Proc. of 13th International Workshop on Languages and Compilers for Parallel Computing -LCPC’2000, NY, USA, August 2000, pp. 34-48.

[17] Domingos, P. (1996). Unifying instance-based and rule-based induction. Machine Learning, 24(2), 141-168.

[18] Derrac, J., García, S., & Herrera, F. (2010). A survey on evolutionary instance selection and generation.

[19] 譚磊，2013，大數據挖掘-從巨量資料發現別人看不到的秘密，台北 : 上奇時代

[20] Leyva, E., González, A., & Pérez, R. (2015). Three new instance selection methods based on local sets: A comparative study with several approaches from a biobjective perspective. Pattern Recognition, 48, 1523–1537. doi:10.1016/j.patcog.
2014.10.001.

[21] Wilson, D. R., & Martinez, T. R. (2000). Reduction techniques for instance-based learning algorithms. Machine Learning, 38(3), 257-286.

[22] Nikolaidis, K., Goulermas, J. Y., & Wu, Q. H. (2011). A class boundary preserving algorithm for data condensation. Pattern Recognition, 44(3), 704-715.
[23] Kuncheva, L. I., and S´anchez, J. S., 2008, “Nearest Neighbour Classifiers for Streaming Data with Delayed Labelling.” Eighth IEEE International Conference on Data Mining.

[24] Shmueli G, Patel NR, and Bruce PC. 2010. Data Mining for Business Intelligence: Concepts, Techniques, and Applications in Microsoft Office Excel with XLMiner, John Wiley & Sons, 2nd edition

[25] Garcı´a, S., Derrac, J., Cano, J.R., and Herrera, F., 2012, “Prototype Selection for Nearest NeighborClassification: Taxonomy and Empirical Study.” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.34, no.3.

[26] J. Gou, L. Du, Y. Zhang, T. Xiong, A new distance-weighted k-nearest neighbor
classifier, J. Inform. Comput. Sci. 9 (6) (2012) 1429–1436.

[27] Ajmani, S.; Jadhav, K.; Kulkarni, S. A. Three-Dimensional QSAR Using the k-Nearest Neighbor Method and Its Interpretation. J. Chem. Inf. Model. 2006, 46, 24-31.

[28] Vapnik, V.N., 1995, “The Nature of Statistical Learning Theory.” Springer, New York.Williams, B. K., Nichols, J. D., and Conroy, M. J., 2002, “Analysis and management of animal populations.” London: Academic Press.

[29] Vapnik, V. N. (1999). An overview of statistical learning theory. Neural Networks, IEEE Transactions on, 10(5), 988-999.

[30] Cristianini, N., & Shawe-Taylor, J. (2000). An introduction to support vector machines and other kernel-based learning methods: Cambridge university press.

[31] Dong, J.-x., Devroye, L., & Suen, C. Y. (2005). Fast SVM training algorithm with decomposition on very large data sets. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 27(4), 603-618.

[32] Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273-297. doi: 10.1007/BF00994018

[33] Kuhn, H. W. (2014). Nonlinear programming: a historical view Traces and Emergence of Nonlinear Programming (pp. 393-414): Springer.

[34] Chen ZY, Tsai CF, Eberle W, Lin WC, Ke W-C (2014) Instance selection
by genetic-based biological algorithm. Soft Comput. doi:10.1007/
s00500-014-1339-0

[35] Li, J., Wang, Y.: A new fast reduction technique based on binary nearest neighbor tree. Neurocomputing 149, Part C (2015) 1647–657
[36] Senzhang Wang , Zhoujun Li , Chunyang Liu ,Xiaoming Zhang, Haijun Zhang(2014) Training data reduction to speed up SVM training. © Springer Science+Business Media New York 2014

[37] Hamidzadeh, J., Monsefi, R., & Yazdi, H. S. (2015). IRAHC: Instance Reduction Algorithm using Hyperrectangle Clustering. Pattern Recognition, vol. 48, pp. 1878–1889.

[38] I. Triguero, D. Peralta, J. Bacardit, S. García, F. Herrera, MRPR: a MapReduce
solution for prototype reduction in big data classification, Neurocomputing
150 (20) (2015) 331–345.

指導教授

蔡志豐(Chih-Fong Tsai)

審核日期

2016-8-5

推文