中大機構典藏-NCU Institutional Repository-提供博碩士論文、考古題、期刊論文、研究計畫等下載:Item 987654321/98320
English  |  正體中文  |  简体中文  |  Items with full text/Total items : 83776/83776 (100%)
Visitors : 59508315      Online Users : 630
RC Version 7.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.
Scope Tips:
  • please add "double quotation mark" for query phrases to get precise results
  • please goto advance search for comprehansive author search
  • Adv. Search
    HomeLoginUploadHelpAboutAdminister Goto mobile version


    Please use this identifier to cite or link to this item: https://ir.lib.ncu.edu.tw/handle/987654321/98320


    Title: Enhancing LLM Security: Adaptive Defense Method Against Jailbreaking in Diverse Tasks
    Authors: 賴冠宇;Lai, Guan-Yu
    Contributors: 資訊管理學系
    Keywords: 大型語言模型;越獄攻擊;防禦機制;機密保護;合規性;LoRA 微調
    Date: 2025-07-22
    Issue Date: 2025-10-17 12:37:36 (UTC+8)
    Publisher: 國立中央大學
    Abstract: 隨著大型語言模型(LLMs)迅速普及,其在企業機密管理、政府政務與青少年內容管控等高度敏感場域的安全風險日益凸顯。現有越獄(jailbreaking)防禦多聚焦於阻止暴力、仇恨或違法內容,對「僅觸犯機密與合規規則」的隱性攻擊缺乏成效。為此,本研究提出前端適應式防禦機制( Adaptive Front-end Defense LLM),以輕量 LLaMA 3 (3 B/ 8 B)為核心,結合 LoRA 進行少量參數微調,部署於主回答模型之前,於輸入端即篩阻攻擊提示。我們構建三組新資料集-存取控制、內容相關性與年齡驗證 總計 10 k 樣本,並額外整合 AdvBench 攻擊語料,涵蓋 PAP、 Prompt Packer、 Evil Twins、 DeepInception 等六類攻擊。實驗顯示,在自建場景中防禦成功率( DSR)達 97 100 %,合法查詢
    誤拒率( FRR)低於 3 %;於 AdvBench 通用測試仍維持 >95 % DSR,顯著優於 GPT-4o、 Claude 3.5、 Gemini 2.0-flash 及 Grok 3 等商用模型。進一步的混合規則測試與跨領域外樣本評估證實模型具備良好遷移性,而 10 % 經驗回放可有效緩解連續學習中的災難遺忘。整體推論延遲平均 小於 1s,符合即時服務需求。
    本研究貢獻包括:
    ① 提出可快速增量學習、部署成本低的前端防禦框架;
    ② 開放三個面向合規場景的對抗資料集,填補現有研究空白;
    ③ 系統性量化防禦 可用性權衡,為 LLM 在高敏感應用的安全落地提供實證依據。未來將擴展至多模態攻擊防禦並優化持續學習策略。;The rapid adoption of Large Language Models (LLMs) in corporate knowledge bases, public-sector decisions, and youth services exposes them to jailbreaking attacks that leak confidential or compliance-sensitive content without overt harm. Existing defenses—mainly tuned to violence, hate, or illicit use—struggle in these domain-specific cases. We propose an Adaptive Front-end Defense LLM, a lightweight gatekeeper using 3B and 8B LLaMA-3 backbones with LoRA fine-tuning. Placed before the main model, it screens prompts on ingestion, blocking adversarial queries while passing legitimate ones.
    We created three compliance-focused datasets—Access Control, Content Relevance, and Age Verification—with 10k examples and six jailbreak techniques (PAP, Prompt Packer, Evil Twins, DeepInception, etc.). Our defense achieves 97–100% Defense Success Rate (DSR) with <3% False Rejection Rate (FRR), and over 95% DSR on AdvBench. Compared to GPT-4o, Claude 3.5 Sonnet, Gemini 2.0-flash, and Grok 3, it offers higher security and fewer false positives, with average latency of 1s. Further tests on mixed-rule, out-of-domain, and continual learning tasks show strong transferability; a 10% replay buffer mitigates forgetting when adding new tasks.
    Contributions: (1) a low-cost, adaptive front defense for LLMs; (2) release of three adversarial datasets; (3) evidence that balancing robustness and usability is feasible in real-world, high-sensitivity use. Future work will target multimodal threats and lifelong learning efficiency.
    Appears in Collections:[Graduate Institute of Information Management] Electronic Thesis & Dissertation

    Files in This Item:

    File Description SizeFormat
    index.html0KbHTML18View/Open


    All items in NCUIR are protected by copyright, with all rights reserved.

    社群 sharing

    ::: Copyright National Central University. | 國立中央大學圖書館版權所有 | 收藏本站 | 設為首頁 | 最佳瀏覽畫面: 1024*768 | 建站日期:8-24-2009 :::
    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library IR team Copyright ©   - 隱私權政策聲明