以雲端架構為基礎之分類技術於資料串流探勘之研究

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：11

、訪客IP：18.223.168.194

姓名

許銘釗(MING-CHAO HSU) 查詢紙本館藏

畢業系所

資訊管理學系在職專班

論文名稱

以雲端架構為基礎之分類技術於資料串流探勘之研究

相關論文

★ 利用資料探勘技術建立商用複合機銷售預測模型	★ 應用資料探勘技術於資源配置預測之研究-以某電腦代工支援單位為例
★ 資料探勘技術應用於航空業航班延誤分析-以C公司為例	★ 全球供應鏈下新產品的安全控管-以C公司為例
★ 資料探勘應用於半導體雷射產業-以A公司為例	★ 應用資料探勘技術於空運出口貨物存倉時間預測-以A公司為例
★ 使用資料探勘分類技術優化YouBike運補作業	★ 特徵屬性篩選對於不同資料類型之影響
★ 資料探勘應用於B2B網路型態之企業官網研究-以T公司為例	★ 衍生性金融商品之客戶投資分析與建議-整合分群與關聯法則技術
★ 應用卷積式神經網路建立肝臟超音波影像輔助判別模型	★ 基於卷積神經網路之身分識別系統
★ 能源管理系統電能補值方法誤差率比較分析	★ 企業員工情感分析與管理系統之研發
★ 資料淨化於類別不平衡問題: 機器學習觀點	★ 資料探勘技術應用於旅客自助報到之分析—以C航空公司為例

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

串流資料的總資料量可能隨著時間經過而無限地增長，其屬性也會隨著時間改變，為了避免模型過時，串流處理方式每次皆以目前的模型為基礎，搭配短時間內所產生的最新資料一起迅速地更新模型。目前以雲端架構為基礎，且針對串流資料的分類探勘研究比較少見，因此本研究將使用Apache Spark作為實驗的處理平台，針對串流資料URL Reputation Data Set進行實驗，其結果將可提供未來在Spark上開發新應用程式的比較基礎。
Apache Spark有處理速度快的優點，並且容易進行反覆讀取和迭代運算，適合應用於大數據的資料探勘，另外由於Spark同時具備批次處理和串流處理的能力，亦符合本研究在實驗平台上的需求。
本研究針對串流資料URL Reputation Data Set進行實驗，該資料集透過URL解析出多種屬性，並以這些屬性來判別良性與惡意的網站。由於資料量大並且具有高維度的屬性，無法以單機處理，因此我們使用Apache Spark運行在Amazon EC2的雲端環境上，針對二分類演算法logistic regression和linear SVM進行實驗。
本研究的實驗進行方式分為批次處理與串流處理兩大類，並分別採用不同的資料量來訓練模型。批次處理以Hadoop Distributed File System (HDFS) 做為檔案儲存工具，而串流處理則以Apache Kafka做為串流資料的提供者。兩種方式都透過Spark讀取資料後呼叫機器學習函式庫Machine learning library (MLlib)，然後對函式庫中的logistic regression和linear SVM兩個演算法進行建模和測試。
透過實驗結果，我們驗證了在串流資料中，串流處理不論在速度和準確度都優於批次處理。而linear SVM在兩種處理方式上的準確度普遍優於logistic regression。另外透過實作linear SVM的串流版本，我們也證實了linear SVM適合應用於串流處理。同時我們也發現當linear SVM的串流版本使用批次處理的優化參數預設値，在存取樣本數越大時，則速度雖慢但卻有很好的準確度；若改用串流處理的預設値，與logistic regression相比，則速度雖相近而準確度仍相對較高。

摘要(英)

In streaming data, the total amount of data could increase and its attributes will change over time. In order to avoid constructing outdated model, each time the streaming process mode will quickly update the model based on the current model and the latest produced data. Currently in the cloud-based architecture, the classification techniques for streaming data mining research are relatively rare. Therefore, this study focuses on using Apache Spark as an experimental processing platform for streaming data mining. The findings can be as a comparison base for future development of new applications in Apache Spark.
Apache Spark has the advantage of high processing speed, and easy to read repeatedly and calculate iteratively, therefor suitable for big data mining. Additionally, Spark has the capability of batch and streaming processing, these meet the needs of the present study on the experimental platform.
In this study, we use the URL Reputation Data Set, the data set through the URL to resolve a variety of attributes, and use these attributes to distinguish benign and malicious websites. Because the amount of data is too large and has a high dimension attributes, it is time-consuming to be processed by a single machine. Therefore, Apache Spark running on Amazon EC2 is considered to do the experiment on binary classification algorithms - logistic regression and linear SVM.
The experimental study was divided into two types - Batch Processing and Streaming Processing, and different amounts of data are used to train the classification models. Batch Processing uses Hadoop Distributed File System (HDFS) as a file storage tool and Streaming Processing uses Apache Kafka as a provider of streaming data. Both approaches call through Spark machine learning libraries (MLlib) after reading the data, and then call the API of logistic regression and linear SVM for modeling and testing.
The experimental results show that Streaming Processing performs better than batch processing in terms of speed and classification accuracy. In particular, the accuracy of SVM in the two treatments is generally better than logistic regression. In addition, we demonstrate that SVM is suitable for streaming data mining. Moreover, when the linear SVM streaming version uses the default value of the optimization parameters in Batch Processing, with the greater number of samples are used, the speed is slow, but there is a good degree of accuracy. For the Streaming Processing default value, on the other hand, the streaming version of SVM compared with logistic regression, although they have similar processing speed, the accuracy of SVM is still relatively higher than logistic regression.

關鍵字(中)

★ 雲端
★ 資料串流
★ 分類技術
★ 資料探勘
★ Apache Spark

關鍵字(英)

★ Apache Spark
★ Cloud
★ Data Stream
★ Classification Technology
★ Data Mining

論文目次

摘要 i
誌謝 iv
目錄 v
表目錄 vii
圖目錄 viii
第一章緒論 1
1.1 研究背景 1
1.2 研究動機 2
1.3 研究目的 3
第二章文獻探討 5
2.1 資料探勘定義 5
2.2 大數據資料探勘 7
2.3 LOGISTIC REGRESSION 9
2.4 LINEAR SVM 10
2.5 SPARK 於線性方法的優化選項 13
2.6 LINEAR SVM在串流資料的應用 14
第三章研究方法 16
3.1 使用工具與執行環境 18
3.2 資料集介紹 21
3.3 串流處理實驗 27
第四章研究結果與分析 30
4.1 批次處理 31
4.2 串流處理 36
4.3 批次與串流綜合比較 45
第五章研究結論與建議 47
5.1 研究結論 47
5.2 未來研究方向建議 48
參考文獻 50
附錄 53
1.1 基礎環境設定 53
1.2 批次處理實驗 57
1.3 串流處理實驗 58

參考文獻

中文
1. 吳文群. (2008). 資料串流頻繁項目集探勘之隱私保護研究: 東吳大學資訊管理學系碩士論文.
2. 胡世忠. (2013). 雲端時代的殺手級應用: 海量資料分析: 天下雜誌股份有限公司.
3. 高彥杰. (2015). Spark 大數據處理:技術、應用與性能優化 (pp. 2-3). 北京市: 機械工業出版社.
4. 張云濤, & 龔玲. (2007). 資料探勘原理與技術: 五南圖書出版股份有限公司.
5. 雷祖強, 周天穎, 萬絢, 楊龍士, & 許晉嘉. (2007). 空間特徵分類器支援向量機之研究. 航測及遙測學刊, 12(2), 145-163.
6. 簡禎富，許嘉裕. (2014). 資料探勘與大數據分析. 新北市: 前程文化事業有限公司.
英文
1. Bordes, A., Ertekin, S., Weston, J., & Bottou, L. (2005). Fast kernel classifiers with online and active learning. The Journal of Machine Learning Research, 6, 1579-1619.
2. Boser, B. E., Guyon, I. M., & Vapnik, V. N. (1992). A training algorithm for optimal margin classifiers. Paper presented at the Proceedings of the fifth annual workshop on Computational learning theory.
3. Cabena, P., Hadjinian, P., Stadler, R., Verhees, J., & Zanasi, A. (1998). Discovering data mining: from concept to implementation: Prentice-Hall, Inc.
4. Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine learning, 20(3), 273-297.
5. Dean, J., & Ghemawat, S. (2008). MapReduce: simplified data processing on large clusters. Communications of the ACM, 51(1), 107-113.
6. Domeniconi, C., & Gunopulos, D. (2001). Incremental support vector machine construction. Paper presented at the Data Mining, 2001. ICDM 2001, Proceedings IEEE International Conference on.
7. Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining to knowledge discovery in databases. AI magazine, 17(3), 37.
8. Frawley, W. J., Piatetsky-Shapiro, G., & Matheus, C. J. (1992). Knowledge discovery in databases: An overview. AI magazine, 13(3), 57.
9. Gaber, M. M., Zaslavsky, A., & Krishnaswamy, S. (2009). Data stream mining Data Mining and Knowledge Discovery Handbook (pp. 759-787): Springer.
10. Han, J., Kamber, M., & Pei, J. (2012). 1 - Introduction Data Mining (Third Edition) (pp. 1-38). Boston: Morgan Kaufmann.
11. Hui, S. C., & Jha, G. (2000). Data mining for customer service support. Information & Management, 38(1), 1-13.
12. Justin Ma, Lawrence K. Saul, Stefan Savage, & Voelker, G. M. (2009). Identifying Suspicious URLs: An Application of Large-Scale Online Learning. Paper presented at the International Conference on Machine Learning (ICML), Montreal, Quebec.
13. Nathan, V., & Raghvendra, S. (2014). Accurate Streaming Support Vector Machines. arXiv preprint arXiv:1412.2485.
14. Orabona, F., Castellini, C., Caputo, B., Jie, L., & Sandini, G. (2010). On-line independent support vector machines. Pattern Recognition, 43(4), 1402-1412.
15. Rai, P., Daumé III, H., & Venkatasubramanian, S. (2009). Streamed learning: one-pass SVMs. arXiv preprint arXiv:0908.0572.
16. Rowley, J. (2007). The wisdom hierarchy: representations of the DIKW hierarchy. Journal of Information Science, 17. doi:10.1177/0165551506070706
17. Wu, X., Kumar, V., Quinlan, J. R., Ghosh, J., Yang, Q., Motoda, H., . . . Philip, S. Y. (2008). Top 10 algorithms in data mining. Knowledge and information systems, 14(1), 1-37.
18. Wu, X., Zhu, X., Wu, G.-Q., & Ding, W. (2014). Data mining with big data. Knowledge and Data Engineering, IEEE Transactions on, 26(1), 97-107.
19. Ylonen, T., & Lonvick, C. (2006). The secure shell (SSH) protocol architecture.
20. Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., . . . Stoica, I. (2012). Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. Paper presented at the Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation.
網址
1. Amazon EC2. (2016). Retrieved Fabruary 6, 2016, from https://zh.wikipedia.org/wiki/Amazon_EC2
2. Apache Hadoop. (2016). Retrieved December 20, 2015, from https://hadoop.apache.org
3. Apache Kafka. (2016). Retrieved December 20, 2015, from http://kafka.apache.org
4. Apache Spark. (2016). Retrieved December 20, 2015, from https://spark.apache.org
5. BBC中文网. (2015). 全球近半人口年底上網人口. Retrieved December 20, 2015, from http://www.bbc.com/zhongwen/trad/science/2015/05/150526_world_internet
6. Cygwin. (2016). Retrieved February 10, 2016, from https://www.cygwin.com/
7. Machine learning library (MLlib) guide. (2016). Retrieved December 20, 2015, from http://spark.apache.org/docs/latest/mllib-guide.html
8. URL reputation data set. (2009). Retrieved December 20, 2015, from https://archive.ics.uci.edu/ml/datasets/URL+Reputation

指導教授

蔡志豐

審核日期

2016-6-4

推文