中大機構典藏-NCU Institutional Repository-提供博碩士論文、考古題、期刊論文、研究計畫等下載:Item 987654321/98250
English  |  正體中文  |  简体中文  |  全文筆數/總筆數 : 83776/83776 (100%)
造訪人次 : 59253572      線上人數 : 1534
RC Version 7.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.
搜尋範圍 查詢小技巧:
  • 您可在西文檢索詞彙前後加上"雙引號",以獲取較精準的檢索結果
  • 若欲以作者姓名搜尋,建議至進階搜尋限定作者欄位,可獲得較完整資料
  • 進階搜尋


    請使用永久網址來引用或連結此文件: https://ir.lib.ncu.edu.tw/handle/987654321/98250


    題名: Adaptive Optimization for Balanced Exploration and Exploitation in Reinforcement Learning
    作者: 蔡承峻;Tsai, Cheng-Chun
    貢獻者: 資訊工程學系
    關鍵詞: 強化式學習;自適應;行為多樣性;Reinforcement Learning;Diverse;Adaptive;Behavior
    日期: 2025-07-17
    上傳時間: 2025-10-17 12:32:53 (UTC+8)
    出版者: 國立中央大學
    摘要: 深度強化學習(Deep Reinforcement Learning, DRL)在解決複雜決策問題方
    面展現了顯著的成果,然而在獎勵稀疏或存在多種最適策略的環境中,常面臨
    行為缺乏多樣性的問題。為了解決此限制,我們提出自適應獎勵切換策略優化
    (Adaptive Reward-Switching Policy Optimization, ARPO),這是一種軌跡層級
    的過濾框架,可根據代理在訓練過程中的表現趨勢動態調整其新穎性門檻。
    ARPO 建立在 RSPO(Reward-Switching Policy Optimization)基礎上,使用負對
    數似然平均(NLL mean)作為行為相似性衡量指標,並依據獎勵變化自動調整
    過濾條件。
    這種自適應機制使代理在學習停滯時能促進多樣化探索,而在獎勵提升時
    則專注於策略精煉。我們在具有動態危險與誤導性獎勵的迷宮環境中評估
    ARPO 的表現,並與包括 PPO、DvD、SMERL 及 RSPO 等基準方法進行比較。
    實驗結果顯示,ARPO 在獎勵獲取、行為多樣性與適應能力方面均有更佳表
    現。此研究凸顯了自適應新穎性過濾機制在培養具策略性與策略多樣性的強化
    學習代理中所扮演的重要角色。;Deep reinforcement learning (DRL) has demonstrated notable success in solving complex decision-making problems, yet it often suffers from lack of diversity in environments
    with sparse rewards or multiple optimal strategies. To overcome this limitation, we propose Adaptive Reward-Switching Policy Optimization (ARPO), a trajectory-level filtering
    framework that dynamically adjusts its novelty threshold according to the agent’s performance trends. ARPO builds upon the Reward-Switching Policy Optimization (RSPO)
    paradigm by using NLL mean as a behavioral similarity measure and adapts the filtering
    threshold based on reward dynamics during training. This adaptive mechanism enables
    the agent to promote diverse exploration when progress stagnates and focus on policy
    refinement when rewards improve. We evaluate ARPO in challenging maze environments
    with dynamic hazards and deceptive rewards, comparing its performance with baseline
    methods including PPO, DvD, SMERL, and RSPO. Experimental results demonstrate
    that ARPO achieves higher reward acquisition, greater behavioral diversity, and improved
    adaptability. This work highlights the importance of adaptive novelty filtering in developing robust and strategically diverse reinforcement learning agents.
    顯示於類別:[資訊工程研究所] 博碩士論文

    文件中的檔案:

    檔案 描述 大小格式瀏覽次數
    index.html0KbHTML10檢視/開啟


    在NCUIR中所有的資料項目都受到原著作權保護.

    社群 sharing

    ::: Copyright National Central University. | 國立中央大學圖書館版權所有 | 收藏本站 | 設為首頁 | 最佳瀏覽畫面: 1024*768 | 建站日期:8-24-2009 :::
    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library IR team Copyright ©   - 隱私權政策聲明