dc.description.abstract | In streaming data, the total amount of data could increase and its attributes will change over time. In order to avoid constructing outdated model, each time the streaming process mode will quickly update the model based on the current model and the latest produced data. Currently in the cloud-based architecture, the classification techniques for streaming data mining research are relatively rare. Therefore, this study focuses on using Apache Spark as an experimental processing platform for streaming data mining. The findings can be as a comparison base for future development of new applications in Apache Spark.
Apache Spark has the advantage of high processing speed, and easy to read repeatedly and calculate iteratively, therefor suitable for big data mining. Additionally, Spark has the capability of batch and streaming processing, these meet the needs of the present study on the experimental platform.
In this study, we use the URL Reputation Data Set, the data set through the URL to resolve a variety of attributes, and use these attributes to distinguish benign and malicious websites. Because the amount of data is too large and has a high dimension attributes, it is time-consuming to be processed by a single machine. Therefore, Apache Spark running on Amazon EC2 is considered to do the experiment on binary classification algorithms - logistic regression and linear SVM.
The experimental study was divided into two types - Batch Processing and Streaming Processing, and different amounts of data are used to train the classification models. Batch Processing uses Hadoop Distributed File System (HDFS) as a file storage tool and Streaming Processing uses Apache Kafka as a provider of streaming data. Both approaches call through Spark machine learning libraries (MLlib) after reading the data, and then call the API of logistic regression and linear SVM for modeling and testing.
The experimental results show that Streaming Processing performs better than batch processing in terms of speed and classification accuracy. In particular, the accuracy of SVM in the two treatments is generally better than logistic regression. In addition, we demonstrate that SVM is suitable for streaming data mining. Moreover, when the linear SVM streaming version uses the default value of the optimization parameters in Batch Processing, with the greater number of samples are used, the speed is slow, but there is a good degree of accuracy. For the Streaming Processing default value, on the other hand, the streaming version of SVM compared with logistic regression, although they have similar processing speed, the accuracy of SVM is still relatively higher than logistic regression. | en_US |