平行運算架構下之巨量資料探勘：分散式與雲端方法之比較

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：22

、訪客IP：18.191.174.125

姓名

葉貞麟(Chen-lin Yeh) 查詢紙本館藏

畢業系所

資訊管理學系

論文名稱

平行運算架構下之巨量資料探勘：分散式與雲端方法之比較
(Big Data Mining with Parallel Computing: A Comparison of Distributed and MapReduce Methodologies)

相關論文

★ 利用資料探勘技術建立商用複合機銷售預測模型	★ 應用資料探勘技術於資源配置預測之研究-以某電腦代工支援單位為例
★ 資料探勘技術應用於航空業航班延誤分析-以C公司為例	★ 全球供應鏈下新產品的安全控管-以C公司為例
★ 資料探勘應用於半導體雷射產業-以A公司為例	★ 應用資料探勘技術於空運出口貨物存倉時間預測-以A公司為例
★ 使用資料探勘分類技術優化YouBike運補作業	★ 特徵屬性篩選對於不同資料類型之影響
★ 資料探勘應用於B2B網路型態之企業官網研究-以T公司為例	★ 衍生性金融商品之客戶投資分析與建議-整合分群與關聯法則技術
★ 應用卷積式神經網路建立肝臟超音波影像輔助判別模型	★ 基於卷積神經網路之身分識別系統
★ 能源管理系統電能補值方法誤差率比較分析	★ 企業員工情感分析與管理系統之研發
★ 資料淨化於類別不平衡問題: 機器學習觀點	★ 資料探勘技術應用於旅客自助報到之分析—以C航空公司為例

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

如今巨量資料 (Big Data) 狂潮來襲，資訊唾手可得的情景已在我們日常生活中，資料爆炸的速度已經超越摩爾定律，對於巨量資料的駕馭能力出現一大挑戰，最大的挑戰在於哪些技術能更善用巨量資料。Big Data的本質就是從資料探勘 (Data Mining) 延伸出來的概念，即時設法從龐大的資料中探勘出資料的價值。隨著網路的普及和雲端運算的發展，打破傳統資料探勘的侷限，運算巨量資料探勘有效率地縮短運算時間。大數據科學家John Rauser定義巨量資料就是超過了一台電腦處理能力的龐大資料量，若以目前單機的硬體設備運算會有運算速度不符合需求、資料儲存容量過小等問題，所以本研究針對傳統資料探勘環境與流程作了改進。本研究的目的是分析兩種運算技術：分散式架構與雲端MapReduce架構，整合運算資源來針對大型資料集做資料分類，就能擴增儲存容量並搭配強大運算能力，加快探勘速度。另一方面，運用樣本選取 (Instance Selection) 過濾雜訊資料能達到資料減量的效果，探討利用資料前處理於巨量資料是否為必要之流程。最後得出何種架構和流程之下，在不犧牲正確率的情況下，獲得最快的執行時間。而實驗結果顯示採用單一台大型主機建置雲端架構，機器數為1~20台且未使用資料前處理配合SVM分類器直接進行分類更能有效率地處理大型資料集。使用四個多至五十萬筆從UCI資料庫和KDD cup提供的大型資料集，來顯示我們所提出的架構與流程的有效性。

摘要(英)

The dataset size is growing faster than Moore′s Law, and the big data frenzy is currently sweeping through our daily life. The challenges of managing massive amounts of big data involve the techniques of making the data accessible. The big data concept is general and encapsulates much of the essence of data mining techniques and they can discovery the most important and relevant knowledge to be valuable information. The advancement of the Internet technology and the popularity of cloud computing can break through the time efficiency limitation of traditional data mining methods over very large scale dataset problems. The technology of big data mining should create the conditions for the efficient mining of massive amounts of data with the aim of creating real-time useful information. The data scientist, John Rauser, defines big data as “any amount of data that’s too big to be handled by one computer.” A standalone device does not have enough memory to efficiently handle big data, and the storage capacity as well. Therefore, big data mining can be efficiently performed via the conventional distributed and MapReduce methodologies. This raises an important research question: Do the distributed and MapReduce methodologies over large scale datasets perform differently in mining accuracy and efficiency? And one more question: Does Big data mining need data preprocessing? The experimental results based on four large scale datasets show that the using MapReduce without data preprocessing requires the lest processing time and it allows the classifier to provide the highest rate of classification no matter how many computer nodes are used except for a class imbalance dataset.

關鍵字(中)

★ 巨量資料
★ 資料探勘
★ 分散式運算
★ 雲端運算
★ 樣本選取

關鍵字(英)

★ Big Data
★ Data Mining
★ Distributed Computing
★ Cloud Computing
★ Instance Selection

論文目次

摘要 i
Abstract ii
誌謝 iii
目錄 iv
圖目錄 vi
表目錄 viii
第一章緒論 1
1.1 研究背景 1
1.2 研究動機 2
1.3 研究目的 4
1.4 研究架構 6
第二章文獻探討 8
2.1 巨量資料 8
2.2 分散式運算 10
2.2.1 分散式運算簡介 10
2.2.2 分散式資料探勘 11
2.3 雲端運算 13
2.3.1 雲端運算簡介 13
2.3.2 系統架構 15
2.3.3 雲端資料探勘 20
2.4 資料分類 25
2.5 資料前處理 28
2.5.1 IB3 30
2.5.2 DROP3 31
2.5.3 GA 33
第三章實驗方法 35
3.1 實驗一 36
3.1.1 Baseline 36
3.1.2 分散式架構 37
3.1.3 雲端架構 38
3.1.3.1 單機雲端 38
3.1.3.2 叢集雲端 39
3.2 實驗二 41
3.2.1 單機樣本選取 41
3.2.2 分散式架構 42
3.2.3 雲端架構 43
3.2.3.1 單機雲端 43
3.2.3.2 叢集雲端 44
第四章實驗結果 45
4.1 實驗設定 46
4.1.1 資料集 46
4.1.2 實驗電腦環境 47
4.1.3 模型驗證準則 48
4.2 實驗結果 49
4.2.1 實驗一結果 49
4.2.2 實驗二結果 59
4.3 討論與建議 71
第五章結論 77
5.1 結果與貢獻 77
5.2 研究限制與後續研究建議與方向 79
參考文獻 82
附錄一 87
附錄二 91
附錄三 95
附錄四 96

參考文獻

Aha, D. W., Kibler, D., & Albert, M. K. (1991). Instance-based learning algorithms. Machine Learning, 6(1), 37-66.
Back, T. (1996). Evolutionary algorithms in theory and practice: Oxford Univ. Press.
Beyer, M. A., & Laney, D. (2012). The importance of′big data′: a definition. Stamford, CT: Gartner.
Bughin, J., Chui, M., & Manyika, J. (2010). Clouds, big data, and smart assets: Ten tech-enabled business trends to watch. McKinsey Quarterly, 56(1), 75-86.
Cano, J. R., Herrera, F., & Lozano, M. (2003). Using evolutionary algorithms as instance selection for data reduction in KDD: an experimental study. Evolutionary Computation, IEEE Transactions on, 7(6), 561-575.
Cano, J. R., Herrera, F., & Lozano, M. (2006). On the combination of evolutionary algorithms and stratified strategies for training set selection in data mining. Applied Soft Computing, 6(3), 323-332.
Cervantes, J., Li, X., & Yu, W. (2008). Support vector classification for large data sets by reducing training data with change of classes. Paper presented at the Systems, Man and Cybernetics, 2008. SMC 2008. IEEE International Conference on.
Cheung, D. W., Ng, V. T., Fu, A. W., & Fu, Y. (1996). Efficient mining of association rules in distributed databases. Knowledge and Data Engineering, IEEE Transactions on, 8(6), 911-922.
Collins, D. (2006). Using VMWare and live CD′s to configure a secure, flexible, easy to manage computer lab environment. Journal of Computing Sciences in Colleges, 21(4), 273-277.
Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273-297. doi: 10.1007/BF00994018
Cristianini, N., & Shawe-Taylor, J. (2000). An introduction to support vector machines and other kernel-based learning methods: Cambridge university press.
Da Silva, J. C., Giannella, C., Bhargava, R., Kargupta, H., & Klusch, M. (2005). Distributed data mining and agents. Engineering Applications of Artificial Intelligence, 18(7), 791-807.
Dean, J., & Ghemawat, S. (2008). MapReduce: simplified data processing on large clusters. Communications of the ACM, 51(1), 107-113.
Derrac, J., García, S., & Herrera, F. (2010). A survey on evolutionary instance selection and generation.
Diebold, F. X., Cheng, X., Diebold, S., Foster, D., Halperin, M., Lohr, S., . . . Pospiech, M. (2012). A Personal Perspective on the Origin (s) and Development of “Big Data”: The Phenomenon, the Term, and the Discipline∗.
Domingos, P. (1996). Unifying instance-based and rule-based induction. Machine Learning, 24(2), 141-168.
Dong, J.-x., Devroye, L., & Suen, C. Y. (2005). Fast SVM training algorithm with decomposition on very large data sets. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 27(4), 603-618.
Fan, W., & Bifet, A. (2013). Mining big data: current status, and forecast to the future. ACM SIGKDD Explorations Newsletter, 14(2), 1-5.
Foster, I., Yong, Z., Raicu, I., & Shiyong, L. (2008, 12-16 Nov. 2008). Cloud Computing and Grid Computing 360-Degree Compared. Paper presented at the Grid Computing Environments Workshop, 2008. GCE ′08.
Ghemawat, S., Gobioff, H., & Leung, S.-T. (2003). The Google file system. Paper presented at the ACM SIGOPS operating systems review.
Godfrey, B. (2006). A primer on distributed computing. DOI= http://www.bacchae.co. uk/docs/dist. html. Accessed March, 8, 2010.
Gunn, S. R. (1998). Support vector machines for classification and regression. ISIS technical report, 14.
Guralnik, V., & Karypis, G. (2004). Parallel tree-projection-based sequence mining algorithms. Parallel Computing, 30(4), 443-472.
Holland, J. H. (1975). Adaptation in natural and artificial systems: an introductory analysis with applications to biology, control, and artificial intelligence: U Michigan Press.
Isard, M., Budiu, M., Yu, Y., Birrell, A., & Fetterly, D. (2007). Dryad: distributed data-parallel programs from sequential building blocks. Paper presented at the ACM SIGOPS Operating Systems Review.
Jansen, E. (2003). Netlingo: The Internet Dictionary: Golden Books Centre.
Januzaj, E., Kriegel, H.-P., & Pfeifle, M. (2004). Scalable density-based distributed clustering Knowledge Discovery in Databases: PKDD 2004 (pp. 231-244): Springer.
Jeffrey, C., Brian, D., Mark, D., Joseph, M. H., & Caleb, W. (2009). MAD skills: new analysis practices for big data. Proc. VLDB Endow., 2(2), 1481-1492. doi: 10.14778/1687553.1687576
Jie, L., Zheng, X., Yayun, J., & Rui, Z. (2014, 18-20 Aug. 2014). The overview of big data storage and management. Paper presented at the Cognitive Informatics & Cognitive Computing (ICCI*CC), 2014 IEEE 13th International Conference on.
Karau, H. (2013). Fast Data Processing With Spark: Packt Publishing Ltd.
Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. Paper presented at the Ijcai.
Kovári, A., & Dukan, P. (2012). KVM & OpenVZ virtualization based IaaS open source cloud virtualization platforms: OpenNode, Proxmox VE. Paper presented at the Intelligent Systems and Informatics (SISY), 2012 IEEE 10th Jubilee International Symposium on.
Kuhn, H. W. (2014). Nonlinear programming: a historical view Traces and Emergence of Nonlinear Programming (pp. 393-414): Springer.
Laney, D. (2001). 3D data management: Controlling data volume, velocity and variety. META Group Research Note, 6.
Mashey, J. R. (1997). Big Data and the Next Wave of InfraS-tress. Paper presented at the Computer Science Division Seminar, University of California, Berkeley.
Mayer-Schönberger, V., & Cukier, K. (2013). Big data: A revolution that will transform how we live, work, and think: Houghton Mifflin Harcourt.
Mell, P., & Grance, T. (2011). The NIST definition of cloud computing.
Nikolaidis, K., Goulermas, J. Y., & Wu, Q. H. (2011). A class boundary preserving algorithm for data condensation. Pattern Recognition, 44(3), 704-715.
Noll, M. G. (2007). Running hadoop on ubuntu linux (single-node cluster). Mar-2013.[Online]. Available: http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-nodecluster/ [Accessed:12-Jun-2013].
Olvera-López, J. A., Carrasco-Ochoa, J. A., Martínez-Trinidad, J. F., & Kittler, J. (2010). A review of instance selection methods. Artificial Intelligence Review, 34(2), 133-143.
Ostermann, S., Iosup, A., Yigitbasi, N., Prodan, R., Fahringer, T., & Epema, D. (2010). A Performance Analysis of EC2 Cloud Computing Services for Scientific Computing. In D. Avresky, M. Diaz, A. Bode, B. Ciciani & E. Dekel (Eds.), Cloud Computing (Vol. 34, pp. 115-131): Springer Berlin Heidelberg.
Pallis, G. (2010). Cloud Computing: The New Frontier of Internet Computing. Internet Computing, IEEE, 14(5), 70-73. doi: 10.1109/MIC.2010.113
Panjwani, M. L., & Makhijani, R. K. (2013). Distributed Data Mining and Approaches.
Petre, R. S. (2012). Data mining in cloud computing. Database Systems Journal, 3(3), 67-71.
Rajaraman, A. (2008). More data usually beats better algorithms. Datawocky Blog.
Sagiroglu, S., & Sinanc, D. (2013). Big data: A review. Paper presented at the Collaboration Technologies and Systems (CTS), 2013 International Conference on.
Shackelford, R., McGettrick, A., Sloan, R., Topi, H., Davies, G., Kamali, R., . . . Lunt, B. (2006). Computing curricula 2005: The overview report. ACM SIGCSE Bulletin, 38(1), 456-457.
Spath, D., Ganschar, O., Gerlach, S., Hämmerle, M., Krause, T., & Schlund, S. (2013). Produktionsarbeit der Zukunft-Industrie 4.0: Fraunhofer Verlag.
Sugerman, J., Venkitachalam, G., & Lim, B.-H. (2001). Virtualizing I/O Devices on VMware Workstation′s Hosted Virtual Machine Monitor. Paper presented at the USENIX Annual Technical Conference, General Track.
Tan, P.-N., Steinbach, M., & Kumar, V. (2006). Introduction to data mining (Vol. 1): Pearson Addison Wesley Boston.
Vapnik, V. N. (1999). An overview of statistical learning theory. Neural Networks, IEEE Transactions on, 10(5), 988-999.
Vavilapalli, V. K., Murthy, A. C., Douglas, C., Agarwal, S., Konar, M., Evans, R., . . . Baldeschwieler, E. (2013). Apache Hadoop YARN: yet another resource negotiator. Paper presented at the Proceedings of the 4th annual Symposium on Cloud Computing, Santa Clara, California.
Wang, L., Von Laszewski, G., Younge, A., He, X., Kunze, M., Tao, J., & Fu, C. (2010). Cloud computing: a perspective study. New Generation Computing, 28(2), 137-146.
White, T. (2009). Hadoop: the definitive guide: the definitive guide: " O′Reilly Media, Inc.".
Wilson, D. L. (1972). Asymptotic properties of nearest neighbor rules using edited data. Systems, Man and Cybernetics, IEEE Transactions on(3), 408-421.
Wilson, D. R., & Martinez, T. R. (2000). Reduction techniques for instance-based learning algorithms. Machine Learning, 38(3), 257-286.
Xindong, W., Xingquan, Z., Gong-Qing, W., & Wei, D. (2014). Data mining with big data. Knowledge and Data Engineering, IEEE Transactions on, 26(1), 97-107. doi: 10.1109/TKDE.2013.109
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., . . . Stoica, I. (2012). Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. Paper presented at the Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation.
Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., & Stoica, I. (2010). Spark: cluster computing with working sets. Paper presented at the Proceedings of the 2nd USENIX conference on Hot topics in cloud computing.
Zaki, M. J. (2000). Parallel and distributed data mining: An introduction Large-Scale Parallel Data Mining (pp. 1-23): Springer.
城田真琴. (2013). Big Data大數據的獲利模式: 圖解.案例.策略.實戰: 經濟新潮社出版.

指導教授

蔡志豐(Chih-fong Tsai)

審核日期

2015-7-22

推文