|
English
|
正體中文
|
简体中文
|
全文筆數/總筆數 : 80990/80990 (100%)
造訪人次 : 42119781
線上人數 : 1470
|
|
|
資料載入中.....
|
請使用永久網址來引用或連結此文件:
http://ir.lib.ncu.edu.tw/handle/987654321/95579
|
題名: | 獎勵化約法(R3D):一個新的強化學習複雜任務高效獎勵設計方法;Reductionist Reinforcement Reward Design(R3D): A Novel Method for Efficient Reward Function Design in Complex Reinforcement Learning Tasks |
作者: | 翁星宇;Wong, Hsing-Yu |
貢獻者: | 資訊工程學系 |
關鍵詞: | 強化學習;階層式強化學習;大型語言模型;獎勵函數;reinforcement learning;hierarchical reinforcement learning;LLM;reward function |
日期: | 2024-07-23 |
上傳時間: | 2024-10-09 17:04:16 (UTC+8) |
出版者: | 國立中央大學 |
摘要: | 大型語言模型(Large Language Model, LLM)的崛起為機器人控制領域帶來了許多突破性的發展。然而,現階段利用LLM為強化學習任務進行獎勵函數自動設計的系統在面對複雜任務時存在瓶頸,難以有效完成獎勵設計。而這個問題不僅對LLM是一大挑戰,對於人類專家而言同樣是困難重重。因此,我們提出了一個專門應對強化學習複雜任務的獎勵設計方法—獎勵化約法(Reductionist Reinforcement Reward Design, R3D),並結合LLM的生成能力,設計出LLM獎勵協同設計系統(LLM based Reward Co-design System, LLMRCS)。獎勵化約法透過階層式強化學習的方式,將複雜任務分解為多個子任務,並為每個子任務設計子任務獎勵函數,最終合成出有效的獎勵函數。獎勵化約法不僅簡化獎勵設計流程,更保障了複雜任務獎勵的有效性。LLM獎勵協同設計系統以先前的Eureka系統作為基礎,設計了一套改進方案。其在獎勵生成流程中結合了獎勵化約法,並能夠在最佳化過程中結合專家意見。實驗結果顯示,獎勵化約法顯著提昇了複雜任務的學習效率。在FrankaCubeStack任務中,使用獎勵化約法設計之獎勵達到80%成功率的訓練效率比起傳統方法提昇了50.9倍。而使用獎勵化約法之LLM獎勵協同設計系統能夠在無人干預的情況下生成接近人類設計的獎勵函數,部份任務中甚至能夠超越人類專家。我們也透過實驗指出,獎勵化約法不僅能夠減輕複雜任務設計困難、提昇學習效率,該方法還具有良好的可解釋性。透過分析訓練結果,我們能不斷改進獎勵函數,使獎勵化約法成為一個「獎勵最佳化方法」。最後,我們將獎勵化約法的設計理念做延伸,透過修改最終獎勵使代理人的行為表現出多個策略的融合,以展示獎勵化約法其廣泛的應用方式及未來發展的潛力。;The emergence of Large Language Models (LLMs) has led to numerous groundbreaking advancements in the field of robotic control. However, the current systems that use LLMs to automatically design reward functions for reinforcement learning tasks face significant challenges when dealing with complex tasks. This issue presents a major hurdle not only for LLMs but also for human experts. To address this, we propose Reductionist Reinforcement Reward Design (R3D). We further integrate the generative capabilities of LLMs to create the LLM-based Reward Co-design System (LLMRCS). R3D employs a hierarchical reinforcement learning approach to decompose complex tasks into multiple sub-tasks, each with its own sub-task reward function, which are then combined to form an effective overall reward function. This method simplifies the reward design process and ensures the efficacy of the rewards for complex tasks. Building upon the previous Eureka system, the LLMRCS incorporates improvements by integrating R3D within the reward generation process and leveraging expert input during optimization. Experimental results demonstrate that R3D significantly enhances the learning efficiency for complex tasks. In the FrankaCubeStack task, rewards designed using R3D achieved training efficiency improvements of up to 50.9 times compared to traditional methods, reaching an 80% success rate. Additionally, LLMRCS can autonomously generate reward functions that are comparable to those designed by humans, and in some tasks, it even surpasses human experts. Our experiments also reveal that R3D not only reduces the difficulty of designing rewards for complex tasks and improves learning efficiency but also offers excellent explainability. By analyzing training results, we can continuously refine reward functions, positioning R3D as a method for reward optimization. Finally, we extend the design principles of R3D, demonstrating its potential to generate agents that exhibit a blend of multiple strategies by modifying the final reward. This showcases the broad applicability and future development potential of R3D. |
顯示於類別: | [資訊工程研究所] 博碩士論文
|
文件中的檔案:
檔案 |
描述 |
大小 | 格式 | 瀏覽次數 |
index.html | | 0Kb | HTML | 30 | 檢視/開啟 |
|
在NCUIR中所有的資料項目都受到原著作權保護.
|
::: Copyright National Central University. | 國立中央大學圖書館版權所有 | 收藏本站 | 設為首頁 | 最佳瀏覽畫面: 1024*768 | 建站日期:8-24-2009 :::