一種針對LSTM長序列問題之新型前處理降維方法研究－以Android惡意程式分析為例;A Novel Preprocessing Method for Solving Long Sequence Problem in Android Malware Detection

NCUIR > School of Management at National Central University > Graduate Institute of Information Management > Electronic Thesis & Dissertation > Item 987654321/81342

Please use this identifier to cite or link to this item: http://ir.lib.ncu.edu.tw/handle/987654321/81342

Title:	一種針對LSTM長序列問題之新型前處理降維方法研究－以Android惡意程式分析為例;A Novel Preprocessing Method for Solving Long Sequence Problem in Android Malware Detection
Authors:	徐振皓;Hsu, Cheng-Hao
Contributors:	資訊管理學系
Keywords:	Android;靜態分析;操作碼;前處理;惡意程式分類;LSTM;Android;Static analysis;opcode;Preprocessing;LSTM
Date:	2019-07-29
Issue Date:	2019-09-03 15:45:40 (UTC+8)
Publisher:	國立中央大學
Abstract:	目前Android手機市場的佔比最高，而惡意軟體的成長速度幾乎是以倍數成長。傳統惡意軟體偵測方法採用多種特徵，如：API、 system call、控制流、權限等方式做機器學習分析，然而，這些特徵容易被攻擊者修改以及混淆，另外傳統機器學習大多採用N-gram的方式，之後再特徵選取，不僅運算量大，面對新樣本時特徵又要重新提取。針對LSTM等序列深度學習模型將原始資料輸入模型後也會遇到長序列問題。所謂長序列問題，即輸入越長，模型越難記憶早期特徵，稱為梯度消散。因此部分研究採取訓練Embedding層以及Autoencoder等方式降維，亦即透過將特徵投影到另一維度做降維，但只要資料集有變化，其訓練出的結果就會不同。本篇論文提出一個基於深度學習與創新前處理壓縮技術的Android軟體偵測架構對惡意軟體做偵測，採用較底層的opcode操作碼當作特徵，其具有豐富意義也不容易遭到修改，並提出一種創新的前處理降維方法，在前處理時減少模型輸入資料量，解決深度學習會遭遇到的長序列問題，來達到快速偵測以及彈性訓練模型的目的。在未來面對新特徵及新樣本出現的同時，也可以很容易的擴充現有模型。本研究使用前處理後的opcode特徵向量輸入LSTM模型，實驗結果證明可以在不到3分鐘內訓練出高達95.58%準確度的家族分類模型。;Traditional machine learning mostly uses N-gram methods for serialization data predic-tion, which is not only time-consuming in the pre-processing but also computationally ex-pensive for the model. For the current common malware detection methods, a variety of features such as API, system call, control flow, and permissions are used for machine learn-ing analysis. However, these features depend on expert analysis and to extract multiple fea-tures is also time-consuming. This study uses Dalvik opcode as a feature, which is infor-mation rich and easy to extract. However, for the time series features of the opcode, the LSTM model and other sequence models will need effective dimension reduction approach because of the long sequence problem and variable feature length, resulting in poor training performance and long training time. Some study uses the training Embedding layer and Au-toencoder to reduce the feature dimension. This method requires a layer of network training time. Another method is through feature selection. This method will result in different re-sults as long as the data set changes or the sequence semantic is lost after feature selection. Therefore, in order to solve the above problems, this paper proposes a new pre-processing method to solve the long sequence problem that the LSTM model will encounter, so as to achieve fast training and high accuracy. This study uses a new pre-processing approach combined with an LSTM model to detect malware and achieve 95.58% accuracy on Drebin 10 family and only take 45 seconds to train a model. In addition, in the face of the small training sample problems common to deep learning, this research experiment also proved effective.
Appears in Collections:	[Graduate Institute of Information Management] Electronic Thesis & Dissertation

Files in This Item:

File	Description	Size	Format
index.html		0Kb	HTML	196	View/Open

社群 sharing

Loading...