基於 CodeBERT/GraphCodeBERT 和深度學習模型之網頁木馬偵測研究

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：61

、訪客IP：3.135.215.149

姓名

王冠渝(Guan-Yu Wang) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

基於 CodeBERT/GraphCodeBERT 和深度學習模型之網頁木馬偵測研究
(WebShell Detection Based on CodeBERT/GraphCodeBERT and Deep Learning Model)

相關論文

★ 以伸展樹為基礎的Android Binder Driver	★ 應用增量式學習於多種農作物判釋之研究
★ 應用分類重建學習偵測航照圖幅中的新穎坵塊	★ 用於輔助工業零件辨識之尺寸估算系統
★ 使用無紋理之3D CAD工業零件模型結合長度檢測實現細粒度真實工業零件影像分類	★ 一個建立在平行工作系統上的動態全球計算平台
★ 用權重參照計數演算法執行主動物件垃圾收集	★ 一個動態負載平衡之最大可能性估算計算架構
★ 利用多項系統負載資訊進行動態P2P系統重組的策略研究	★ 基於Hadoop系統的雲端應用程式特徵擷取與計算監測架構
★ 適用於大型動態分散式系統的調適性計算模型	★ 一個提供彈性虛擬資料中心的雲端服務平台
★ 雲端彈性虛擬機房服務平台之資源控管中心	★ 一個適用於自動供應雲端系統的動態調適計算架構
★ 線性相關工作與非相關工作的探索式排程策略	★ 適用於大資料集高效率的分散式階層分群演算法

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

網頁木馬(WebShell) 攻擊長期以來一直是網路管理員的困擾。由於雲端服務的可擴展性和分散式的特性可能加劇 WebShell 攻擊的潛在風險和影響，因此，此類攻擊也成為雲端環境中的主要安全問題之一。因此，近年來，就有多種策略被提出來防範WebShell 的攻擊。本篇基於深度學習技術，提出了兩種有效偵測 WebShell 的方法。這兩種方法皆使用位元組對編碼(Byte Pair Encoding, BPE)對 WebShell 的原始碼進行字串編
碼，將輸入資料分割成 tokens。在生成詞嵌入向量(Word Embedding Vector)方面，方法一使用 CodeBERT，而方法二使用 GraphCodeBERT。這兩種預訓練的 CodeBERT 與GraphCodeBERT 模型使用相同的架構基底並有效理解程式碼，但 GraphCodeBERT 藉由考慮程式碼之間的關聯與內部結構，進一步提升對程式碼的理解能力。此外，方法一與方法二均使用門控遞迴單元(GRU)或雙向門控遞迴單元(雙向 GRU)來檢測程式碼中是否含有 WebShell。在實驗階段，透過使用不同的超參數對這兩種方法進行了訓練，並以K-Fold 交叉驗證來確認最優的結果和相應的模型。之後，利用測試資料集對方法一與方法二的模型進行了實驗，並將結果與相關文獻進行了比較。從實驗結果中觀察到，方法一的準確率達到了 99.54%，精確率為 98.42%，召回率為 99.29%，而 F1 分數為98.85%。方法二則表現更佳，其準確率為 99.65%，精確率為 99.29%，召回率同為99.29%，F1 分數也達到了 99.29%。這些結果顯示，本篇所提出的方法相較於先前的方法有顯著的提升。此外，與其他開源或商業工具相比，本研究所提出的方法在各項指標上都表現出色。特別值得一提的是，本研究提出的方法對於陌生資料和混淆程式碼都具有出色的準確率和精確率，表現出其優越的檢測能力和實用價值。

摘要(英)

WebShell attacks have long been a significant challenge for website administrators. Due to the scalability and distributed nature of cloud services, these factors exacerbate the potential risks and impacts of WebShell attacks, making them one of the main security threats in cloud environments. Consequently, in recent years, various strategies have been proposed to guard against WebShell attacks. This paper presents two effective methods for detecting WebShell, based on deep learning technology. Both methods employ Byte Pair Encoding (BPE) to encode the string of the WebShell source code, split input data into tokens. For generating word embedding vectors, Method 1 uses CodeBERT, while Method 2 employs GraphCodeBERT. These methods effectively understand code using pre-trained CodeBERT and GraphCodeBERT models, and both share the same architecture. GraphCodeBERT, in particular, enhances code comprehension by considering the relationships and internal structures among the code. Additionally, both methods utilize GRU and Bidirectional GRU to detect the presence of WebShell in the code. During the experimental phase, training was conducted on these two methods using various hyperparameters, and the best results and corresponding models were confirmed through K-Fold cross-validation. Subsequently, experiments were performed on models from Methods 1 and 2 using a test dataset, and the results were compared with related works. The experimental results show that Method 1 achieved an accuracy of 99.54%, a precision of 98.42%, a recall of 99.29%, and an F1 score of 98.85%. Method 2 performed even better, with an accuracy of 99.65%, a precision of 99.29%, a recall of 99.29%, and an F1 score
of 99.29%. These results demonstrate significant improvements over previous methods. Moreover, compared to other open-source or commercial tools, the methods proposed in this paper are better in all metrics. Notably, the methods introduced here show outstanding accuracy and precision on unseen data and obfuscated code, showcasing their superior detection capabilities and practical value.

關鍵字(中)

★ 網頁木馬
★ CodeBERT
★ GraphCodeBERT
★ 門控遞迴單元
★ 雙向門控遞迴單元
★ 位元組對編碼

關鍵字(英)

★ WebShell
★ CodeBERT
★ GraphCodeBERT
★ GRU
★ Bidirectional GRU
★ BPE

論文目次

中文摘要 iv
Abstract v
Table of Contents vi
List of Figures viii
List of Tables x
Chapter I Introduction 1
1-1 Research Background 1
1-2 Motivation 1
1-3 Contribution 3
Chapter II Background Knowledge 4
2-1 Malware Analysis 4
2-1-1 Static and Dynamic Malware Analysis 4
2-2 PHP WebShell 6
2-2-1 Obfuscated Code 7
2-3 Tokenization Method 8
2-3-1 Subword Encoding 8
2-3-2 Byte Pair Encoding (BPE) 8
2-4 Word Embedding Model 10
2-4-1 Transformer 10
2-4-2 BERT 11
2-4-3 RoBERTa 12
2-4-4 CodeBERT 13
2-4-5 GraphCodeBERT 15
2-4-6 Difference Between CodeBERT and GraphCodeBERT 17
2-5 Classification Model 18
2-5-1 LSTM 18
2-5-2 GRU and Bidirectional GRU 20
Chapter III Related Work 23
3-1 Based on Machine Learning 23
3-2 Based on Deep Learning 25
Chapter IV Proposed Methods 32
4-1 Data Pre-processing 34
4-2 Tokenizer 34
4-2-1 Method 1 and Method 2 34
4-3 Word Embedding model 35
4-3-1 Method 1 36
4-3-2 Method 2 36
4-4 Classification Model 37
4-4-1 Method 1 and Method 2 38
4-4-2 Training procedures 41
Chapter V Experiment and Evaluation 42
5-1 Dataset and Configuration 42
5-2 Evaluation Matrix 45
5-3 Performance of Proposed methods 46
5-3-1 Stage 1 of Training Procedure: Identify optimal hyper-parameters 47
5-3-2 Stage 2 of Training Procedure: K-Fold Cross-Validation 56
5-3-3 Experiment 1: Test Data for generalizable testing 58
5-3-4 Experiment 2: Obfuscated Code 60
5-3-5 Experiment 3: Comparison with related works 62
5-3-6 Experiment 4: Comparison with other related tools 63
Chapter VI Conclusion and Future Work 65
References 66

參考文獻

[1] Netcraft, “February 2024 Web Server Survey,” Available at: https://www.netcraft.com/blog/february-2024-web-server-survey/. (Accessed 23 Apr., 2024).
[2] M. Jangjou and M.K. Sohrabi, "A Comprehensive Survey on Security Challenges in Different Network Layers in Cloud Computing," Arch Computat Methods Eng, vol. 29, pp. 3587–3608, (2022).
[3] Acunetix, “Spring 2021 Edition: Acunetix Web Vulnerability Report”. Available at: https://www.acunetix.com/white-papers/acunetix-web-application-vulnerability-report-2021/. (Accessed 23 Apr., 2024).
[4] Microsoft, “Web shell attacks continue to rise”. Available at: https://www.microsoft.com/en-us/security/blog/2021/02/11/web-shell-attacks-continue-to-rise/. (Accessed 23 Apr., 2024).
[5] CISA, “Malware Analysis Report”. Available at: https://www.cisa.gov/sites/default/files/2023-06/mar-10365227.r3.v1.clear_.pdf. (Accessed 23 Apr., 2024).
[6] Kaspersky, “PHP language source code compromise attempt”. Available at: https://www.kaspersky.com/blog/php-git-backdor/39191/. (Accessed 23 Apr., 2024).
[7] W. Yang, B. Sun, and B. Cui, “A Webshell Detection Technology Based on HTTP Traffic Analysis,” Innovative Mobile and Internet Services in Ubiquitous Computing, pp.336-342. Springer (2019).
[8] H.V. Le, H.V. Vo, T.N. Nguyen, H.N. Nguyen, and, H.T. Du, “Towards a Webshell Detection Approach Using Rule-Based and Deep HTTP Traffic Analysis,” Computational Collective Intelligence, pp.571-584. Springer (2022).
[9] W. Kang, S. Zhong, K. Chen, J. Lai, and G. Xu, “RF-AdaCost: WebShell Detection Method that Combines Statistical Features and Opcode,” Frontiers in Cyber Security, pp.667-682. Springer (2020).
[10] Z. Pan, Y. Chen, Y. Chen, Y. Shen, and X. Guo, “Webshell detection based on executable data characteristics of PHP code,” Wireless Communications and Mobile Computing, vol. 2021, no. 12, article 5533963. (2021).
[11] N.-H. Nguyen, V.-H. Le, V.-O. Phung, and P.-H. Du, “Toward a Deep Learning Approach for Detecting PHP Webshell,” in Proceedings of the 10th International Symposium on Information and Communication Technology (SoICT ′19), Pages 514–521, December 2019. ACM Digital Library, New York, United States (2019).
[12] Z. Ai, N. Luktarhan, Y. Zhao, and C. Tang, “WS-LSMR: Malicious WebShell Detection Algorithm Based on Ensemble Learning,” IEEE Access, vol. 8, pp. 75785-75797, (2020).
[13] A. Hannousse, M.C. Nait-Hamoud, and S. Yahiouche, “A deep learner model for multi-language webshell detection,” Int. J. Inf. Secur., vol. 22, pp. 47–61, (2023).
[14] Y. Fang, Y. Qiu, L. Liu, and C. Huang, “Detecting Webshell Based on Random Forest with FastText,” in Proceedings of the 2018 International Conference on Computing and Artificial Intelligence (ICCAI ′18). Pages 52–56, March 2018. ACM Digital Library, New York, United States (2018).
[15] T. Li, C. Ren, Y. Fu, J. Xu, J. Guo, and X. Chen, “Webshell Detection Based on the Word Attention Mechanism,” IEEE Access, vol. 7, pp. 185140-185147, (2019).
[16] W. Huang et al., “Enhancing the Feature Profiles of Web Shells by Analyzing the Performance of Multiple Detectors,” Advances in Digital Forensics XVI, vol 589. Springer (2022).
[17] Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin, T. Liu, and D. Jiang, “CodeBERT: A Pre-Trained Model for Programming and Natural Languages,” arXiv preprint arXiv:2002.08155, (2020).
[18] C. Niu, C. Li, V. Ng, D. Chen, J. Ge, and B. Luo, “An Empirical Comparison of Pre-Trained Models of Source Code,” arXiv preprint arXiv:2302.04026, (2023).
[19] D. Guo, S. Ren, S. Lu, Z. Feng, D. Tang, S. Liu, L. Zhou, N. Duan, A. Svyakovskiy, S. Fu, M. Tufano, S.K. Deng, C. Clement, D. Drain, N. Sundaresan, J. Yin, D. Jiang, and M. Zhou, “GraphCodeBERT: Pre-training Code Representations with Data Flow,” arXiv preprint arXiv:2009.08366, (2020).
[20] Ö. Aslan and R. Samet, “Investigation of Possibilities to Detect Malware Using Existing Tools,” in 2017 IEEE/ACS 14th International Conference on Computer Systems and Applications (AICCSA), Hammamet, Tunisia, pp. 1277-1284. IEEE, (2017).
[21] R. Sennrich, B. Haddow, and A. Birch, “Neural Machine Translation of Rare Words with Subword Units,” arXiv preprint arXiv:1508.07909, (2016).
[22] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, L. Kaiser, and L. Polosukhin, “Attention Is All You Need,” arXiv preprint arXiv:1706.03762, (2017).
[23] J. Devlin, M.W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” arXiv preprint arXiv:1810.04805, (2018).
[24] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “RoBERTa: A Robustly Optimized BERT Pretraining Approach,” arXiv preprint arXiv:1907.11692, (2019).
[25] T.B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D.M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei, “Language Models are Few-Shot Learners,” arXiv preprint arXiv:2005.14165, (2020).
[26] M.N. Hossain, S.M. Milajerdi, J. Wang, B. Eshete, R. Gjomemo, R. Sekar, S. Stoller, and V.N. Venkatakrishnan, "{SLEUTH}: Real-Time Attack Scenario Reconstruction from {COTS} Audit Data," in Proceedings of the 26th {USENIX} Security Symposium, Vancouver, BC, Canada, August 16–18, 2017, pp. 487–504. USENIX Association, (2017).
[27] K.S. Wong, K. Tanaka, K. Takagi, and Y. Nakajima, “An Efficient Hybrid Webshell Detection Method for Webserver of Marine Transportation Systems,” IEEE Transactions on Intelligent Transportation Systems, vol. 24, no. 2, pp. 2630-2642, (2023).
[28] D-shield. D-shield. Available at: https://www.d99net.net/. (Accessed 23 Apr., 2024).
[29] PHP-malware-finder. Available at: https://github.com/nbs-system/php-malware-finder. (Accessed 23 Apr., 2024).
[30] X. Sun, X. Lu, and H. Dai, “A Matrix Decomposition based Webshell Detection Method,” in Proceedings of the 2017 International Conference on Cryptography, Security and Privacy (ICCSP ′17). Pages 66–70, March 2017. ACM Digital Library, New York, United States (2017).
[31] H. Zhang, M. Liu, Z. Yue, Z. Xue, Y. Shi, and X. He, “A PHP and JSP Web Shell Detection System with Text Processing Based on Machine Learning,” in 2020 IEEE 19th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom), Guangzhou, China, pp. 1584-1591. IEEE (2020).
[32] T. Zhu, Z. Weng, L. Fu, and L. Ruan, "A Web Shell Detection Method Based on Multiview Feature Fusion," Applied Sciences, vol. 10, p. 6274, (2020).
[33] H. Cui, D. Huang, Y. Fang, L. Liu, and C. Huang, "Webshell Detection Based on Random Forest–Gradient Boosting Decision Tree Algorithm," in 2018 IEEE Third International Conference on Data Science in Cyberspace (DSC), Guangzhou, China, pp. 153-160, IEEE (2018).
[34] Z. Zhang, M. Li, L. Zhu, and X. Li, "SmartDetect: A Smart Detection Scheme for Malicious Web Shell Codes via Ensemble Learning," Smart Computing and Communication. SmartCom 2018, pp. 218-230. Springer (2018).
[35] B. Yong, W. Wei, K. Li, J. Shen, Q. Zhou, M. Wozniak, D. Połap, and R. Damaševiˇcius, "Ensemble machine learning approaches for webshell detection in Internet of things environments," Transactions on Emerging Telecommunications Technologies, (2020).
[36] Z. Ai, N. Luktarhan, A. Zhou, and D. Lv, "WebShell Attack Detection Based on a Deep Super Learner," Symmetry, vol. 12, p. 1406, (2020).
[37] Z. Liu, D. Li, L. Wei, and Y. Guo, "A New Method for WebShell Detection Based on Bidirectional GRU and Attention Mechanism," Security and Communication Networks, vol. 2022, (2022).
[38] B. Cheng, Y. Guo, Y. Ren, G. Yang, and G. Xu, "MSDetector: A Static PHP Webshell Detection System Based on Deep-Learning,". Theoretical Aspects of Software Engineering. TASE 2022, pp. 257-269. Springer (2022).
[39] T. An, X. Shui, and H. Gao, "Deep Learning Based Webshell Detection Coping with Long Text and Lexical Ambiguity," Information and Communications Security. ICICS 2022, pp. 123-137. Springer (2022).
[40] Yakpro-po. Available at: https://github.com/pk-fr/yakpro-po. (Accessed 23 Apr., 2024).
[41] Shell-Detector. Available at: https://github.com/emposha/Shell-Detector. (Accessed 23 Apr., 2024).
[42] WebShellKiller. Available at: https://edr.sangfor.com.cn/api/download/WebShellKillerTool.zip. (Accessed 23 Apr., 2024).
[43] CloudWalker. Available at: https://github.com/chaitin/cloudwalker. (Accessed 23 Apr., 2024).

指導教授

王尉任(Wei-Jen Wang)

審核日期

2024-7-3

推文