Enhancing LLM Security: Adaptive Defense Method Against Jailbreaking in Diverse Tasks

NCUIR > School of Management at National Central University > Graduate Institute of Information Management > Electronic Thesis & Dissertation > Item 987654321/98320

Please use this identifier to cite or link to this item: https://ir.lib.ncu.edu.tw/handle/987654321/98320

Title:	Enhancing LLM Security: Adaptive Defense Method Against Jailbreaking in Diverse Tasks
Authors:	賴冠宇;Lai, Guan-Yu
Contributors:	資訊管理學系
Keywords:	大型語言模型;越獄攻擊;防禦機制;機密保護;合規性;LoRA 微調
Date:	2025-07-22
Issue Date:	2025-10-17 12:37:36 (UTC+8)
Publisher:	國立中央大學
Abstract:	隨著大型語言模型（LLMs）迅速普及，其在企業機密管理、政府政務與青少年內容管控等高度敏感場域的安全風險日益凸顯。現有越獄(jailbreaking）防禦多聚焦於阻止暴力、仇恨或違法內容，對「僅觸犯機密與合規規則」的隱性攻擊缺乏成效。為此，本研究提出前端適應式防禦機制（ Adaptive Front-end Defense LLM），以輕量 LLaMA 3 (3 B/ 8 B）為核心，結合 LoRA 進行少量參數微調，部署於主回答模型之前，於輸入端即篩阻攻擊提示。我們構建三組新資料集-存取控制、內容相關性與年齡驗證總計 10 k 樣本，並額外整合 AdvBench 攻擊語料，涵蓋 PAP、 Prompt Packer、 Evil Twins、 DeepInception 等六類攻擊。實驗顯示，在自建場景中防禦成功率（ DSR）達 97 100 %，合法查詢誤拒率（ FRR）低於 3 %；於 AdvBench 通用測試仍維持 >95 % DSR，顯著優於 GPT-4o、 Claude 3.5、 Gemini 2.0-flash 及 Grok 3 等商用模型。進一步的混合規則測試與跨領域外樣本評估證實模型具備良好遷移性，而 10 % 經驗回放可有效緩解連續學習中的災難遺忘。整體推論延遲平均小於 1s，符合即時服務需求。本研究貢獻包括： ① 提出可快速增量學習、部署成本低的前端防禦框架； ② 開放三個面向合規場景的對抗資料集，填補現有研究空白； ③ 系統性量化防禦可用性權衡，為 LLM 在高敏感應用的安全落地提供實證依據。未來將擴展至多模態攻擊防禦並優化持續學習策略。;The rapid adoption of Large Language Models (LLMs) in corporate knowledge bases, public-sector decisions, and youth services exposes them to jailbreaking attacks that leak confidential or compliance-sensitive content without overt harm. Existing defenses—mainly tuned to violence, hate, or illicit use—struggle in these domain-specific cases. We propose an Adaptive Front-end Defense LLM, a lightweight gatekeeper using 3B and 8B LLaMA-3 backbones with LoRA fine-tuning. Placed before the main model, it screens prompts on ingestion, blocking adversarial queries while passing legitimate ones. We created three compliance-focused datasets—Access Control, Content Relevance, and Age Verification—with 10k examples and six jailbreak techniques (PAP, Prompt Packer, Evil Twins, DeepInception, etc.). Our defense achieves 97–100% Defense Success Rate (DSR) with <3% False Rejection Rate (FRR), and over 95% DSR on AdvBench. Compared to GPT-4o, Claude 3.5 Sonnet, Gemini 2.0-flash, and Grok 3, it offers higher security and fewer false positives, with average latency of 1s. Further tests on mixed-rule, out-of-domain, and continual learning tasks show strong transferability; a 10% replay buffer mitigates forgetting when adding new tasks. Contributions: (1) a low-cost, adaptive front defense for LLMs; (2) release of three adversarial datasets; (3) evidence that balancing robustness and usability is feasible in real-world, high-sensitivity use. Future work will target multimodal threats and lifelong learning efficiency.
Appears in Collections:	[Graduate Institute of Information Management] Electronic Thesis & Dissertation

Files in This Item:

File	Description	Size	Format
index.html		0Kb	HTML	18	View/Open

社群 sharing

Loading...