摘要: | 串流資料的總資料量可能隨著時間經過而無限地增長,其屬性也會隨著時間改變,為了避免模型過時,串流處理方式每次皆以目前的模型為基礎,搭配短時間內所產生的最新資料一起迅速地更新模型。目前以雲端架構為基礎,且針對串流資料的分類探勘研究比較少見,因此本研究將使用Apache Spark作為實驗的處理平台,針對串流資料URL Reputation Data Set進行實驗,其結果將可提供未來在Spark上開發新應用程式的比較基礎。 Apache Spark有處理速度快的優點,並且容易進行反覆讀取和迭代運算,適合應用於大數據的資料探勘,另外由於Spark同時具備批次處理和串流處理的能力,亦符合本研究在實驗平台上的需求。 本研究針對串流資料URL Reputation Data Set進行實驗,該資料集透過URL解析出多種屬性,並以這些屬性來判別良性與惡意的網站。由於資料量大並且具有高維度的屬性,無法以單機處理,因此我們使用Apache Spark運行在Amazon EC2的雲端環境上,針對二分類演算法logistic regression和linear SVM進行實驗。 本研究的實驗進行方式分為批次處理與串流處理兩大類,並分別採用不同的資料量來訓練模型。批次處理以Hadoop Distributed File System (HDFS) 做為檔案儲存工具,而串流處理則以Apache Kafka做為串流資料的提供者。兩種方式都透過Spark讀取資料後呼叫機器學習函式庫Machine learning library (MLlib),然後對函式庫中的logistic regression和linear SVM兩個演算法進行建模和測試。 透過實驗結果,我們驗證了在串流資料中,串流處理不論在速度和準確度都優於批次處理。而linear SVM在兩種處理方式上的準確度普遍優於logistic regression。另外透過實作linear SVM的串流版本,我們也證實了linear SVM適合應用於串流處理。同時我們也發現當linear SVM的串流版本使用批次處理的優化參數預設値,在存取樣本數越大時,則速度雖慢但卻有很好的準確度;若改用串流處理的預設値,與logistic regression相比,則速度雖相近而準確度仍相對較高。 ;In streaming data, the total amount of data could increase and its attributes will change over time. In order to avoid constructing outdated model, each time the streaming process mode will quickly update the model based on the current model and the latest produced data. Currently in the cloud-based architecture, the classification techniques for streaming data mining research are relatively rare. Therefore, this study focuses on using Apache Spark as an experimental processing platform for streaming data mining. The findings can be as a comparison base for future development of new applications in Apache Spark. Apache Spark has the advantage of high processing speed, and easy to read repeatedly and calculate iteratively, therefor suitable for big data mining. Additionally, Spark has the capability of batch and streaming processing, these meet the needs of the present study on the experimental platform. In this study, we use the URL Reputation Data Set, the data set through the URL to resolve a variety of attributes, and use these attributes to distinguish benign and malicious websites. Because the amount of data is too large and has a high dimension attributes, it is time-consuming to be processed by a single machine. Therefore, Apache Spark running on Amazon EC2 is considered to do the experiment on binary classification algorithms - logistic regression and linear SVM. The experimental study was divided into two types - Batch Processing and Streaming Processing, and different amounts of data are used to train the classification models. Batch Processing uses Hadoop Distributed File System (HDFS) as a file storage tool and Streaming Processing uses Apache Kafka as a provider of streaming data. Both approaches call through Spark machine learning libraries (MLlib) after reading the data, and then call the API of logistic regression and linear SVM for modeling and testing. The experimental results show that Streaming Processing performs better than batch processing in terms of speed and classification accuracy. In particular, the accuracy of SVM in the two treatments is generally better than logistic regression. In addition, we demonstrate that SVM is suitable for streaming data mining. Moreover, when the linear SVM streaming version uses the default value of the optimization parameters in Batch Processing, with the greater number of samples are used, the speed is slow, but there is a good degree of accuracy. For the Streaming Processing default value, on the other hand, the streaming version of SVM compared with logistic regression, although they have similar processing speed, the accuracy of SVM is still relatively higher than logistic regression. |