Enhancing LLM Security: Adaptive Defense Method Against Jailbreaking in Diverse Tasks

NCU Institutional Repository > 管理學院 > 資訊管理研究所 > 博碩士論文 > Item 987654321/98320

請使用永久網址來引用或連結此文件: https://ir.lib.ncu.edu.tw/handle/987654321/98320

題名:	Enhancing LLM Security: Adaptive Defense Method Against Jailbreaking in Diverse Tasks
作者:	賴冠宇;Lai, Guan-Yu
貢獻者:	資訊管理學系
關鍵詞:	大型語言模型;越獄攻擊;防禦機制;機密保護;合規性;LoRA 微調
日期:	2025-07-22
上傳時間:	2025-10-17 12:37:36 (UTC+8)
出版者:	國立中央大學
摘要:	隨著大型語言模型（LLMs）迅速普及，其在企業機密管理、政府政務與青少年內容管控等高度敏感場域的安全風險日益凸顯。現有越獄(jailbreaking）防禦多聚焦於阻止暴力、仇恨或違法內容，對「僅觸犯機密與合規規則」的隱性攻擊缺乏成效。為此，本研究提出前端適應式防禦機制（ Adaptive Front-end Defense LLM），以輕量 LLaMA 3 (3 B/ 8 B）為核心，結合 LoRA 進行少量參數微調，部署於主回答模型之前，於輸入端即篩阻攻擊提示。我們構建三組新資料集-存取控制、內容相關性與年齡驗證總計 10 k 樣本，並額外整合 AdvBench 攻擊語料，涵蓋 PAP、 Prompt Packer、 Evil Twins、 DeepInception 等六類攻擊。實驗顯示，在自建場景中防禦成功率（ DSR）達 97 100 %，合法查詢誤拒率（ FRR）低於 3 %；於 AdvBench 通用測試仍維持 >95 % DSR，顯著優於 GPT-4o、 Claude 3.5、 Gemini 2.0-flash 及 Grok 3 等商用模型。進一步的混合規則測試與跨領域外樣本評估證實模型具備良好遷移性，而 10 % 經驗回放可有效緩解連續學習中的災難遺忘。整體推論延遲平均小於 1s，符合即時服務需求。本研究貢獻包括： ① 提出可快速增量學習、部署成本低的前端防禦框架； ② 開放三個面向合規場景的對抗資料集，填補現有研究空白； ③ 系統性量化防禦可用性權衡，為 LLM 在高敏感應用的安全落地提供實證依據。未來將擴展至多模態攻擊防禦並優化持續學習策略。;The rapid adoption of Large Language Models (LLMs) in corporate knowledge bases, public-sector decisions, and youth services exposes them to jailbreaking attacks that leak confidential or compliance-sensitive content without overt harm. Existing defenses—mainly tuned to violence, hate, or illicit use—struggle in these domain-specific cases. We propose an Adaptive Front-end Defense LLM, a lightweight gatekeeper using 3B and 8B LLaMA-3 backbones with LoRA fine-tuning. Placed before the main model, it screens prompts on ingestion, blocking adversarial queries while passing legitimate ones. We created three compliance-focused datasets—Access Control, Content Relevance, and Age Verification—with 10k examples and six jailbreak techniques (PAP, Prompt Packer, Evil Twins, DeepInception, etc.). Our defense achieves 97–100% Defense Success Rate (DSR) with <3% False Rejection Rate (FRR), and over 95% DSR on AdvBench. Compared to GPT-4o, Claude 3.5 Sonnet, Gemini 2.0-flash, and Grok 3, it offers higher security and fewer false positives, with average latency of 1s. Further tests on mixed-rule, out-of-domain, and continual learning tasks show strong transferability; a 10% replay buffer mitigates forgetting when adding new tasks. Contributions: (1) a low-cost, adaptive front defense for LLMs; (2) release of three adversarial datasets; (3) evidence that balancing robustness and usability is feasible in real-world, high-sensitivity use. Future work will target multimodal threats and lifelong learning efficiency.
顯示於類別:	[資訊管理研究所] 博碩士論文

文件中的檔案:

檔案	描述	大小	格式	瀏覽次數
index.html		0Kb	HTML	18	檢視/開啟

在NCUIR中所有的資料項目都受到原著作權保護.

社群 sharing

資料載入中.....