獎勵化約法（R3D）：一個新的強化學習複雜任務高效獎勵設計方法

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：52

、訪客IP：3.145.68.120

姓名

翁星宇(Hsing-Yu Wong) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

獎勵化約法（R3D）：一個新的強化學習複雜任務高效獎勵設計方法
(Reductionist Reinforcement Reward Design(R3D): A Novel Method for Efficient Reward Function Design in Complex Reinforcement Learning Tasks)

相關論文

★ 整合GRAFCET虛擬機器的智慧型控制器開發平台	★ 分散式工業電子看板網路系統設計與實作
★ 設計與實作一個基於雙攝影機視覺系統的雙點觸控螢幕	★ 智慧型機器人的嵌入式計算平台
★ 一個即時移動物偵測與追蹤的嵌入式系統	★ 一個固態硬碟的多處理器架構與分散式控制演算法
★ 基於立體視覺手勢辨識的人機互動系統	★ 整合仿生智慧行為控制的機器人系統晶片設計
★ 嵌入式無線影像感測網路的設計與實作	★ 以雙核心處理器為基礎之車牌辨識系統
★ 基於立體視覺的連續三維手勢辨識	★ 微型、超低功耗無線感測網路控制器設計與硬體實作
★ 串流影像之即時人臉偵測、追蹤與辨識─嵌入式系統設計	★ 一個快速立體視覺系統的嵌入式硬體設計
★ 即時連續影像接合系統設計與實作	★ 基於雙核心平台的嵌入式步態辨識系統

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 (2029-7-22以後開放)

摘要(中)

大型語言模型（Large Language Model, LLM）的崛起為機器人控制領域帶來了許多突破性的發展。然而，現階段利用LLM為強化學習任務進行獎勵函數自動設計的系統在面對複雜任務時存在瓶頸，難以有效完成獎勵設計。而這個問題不僅對LLM是一大挑戰，對於人類專家而言同樣是困難重重。因此，我們提出了一個專門應對強化學習複雜任務的獎勵設計方法—獎勵化約法（Reductionist Reinforcement Reward Design, R3D），並結合LLM的生成能力，設計出LLM獎勵協同設計系統（LLM based Reward Co-design System, LLMRCS）。獎勵化約法透過階層式強化學習的方式，將複雜任務分解為多個子任務，並為每個子任務設計子任務獎勵函數，最終合成出有效的獎勵函數。獎勵化約法不僅簡化獎勵設計流程，更保障了複雜任務獎勵的有效性。LLM獎勵協同設計系統以先前的Eureka系統作為基礎，設計了一套改進方案。其在獎勵生成流程中結合了獎勵化約法，並能夠在最佳化過程中結合專家意見。實驗結果顯示，獎勵化約法顯著提昇了複雜任務的學習效率。在FrankaCubeStack任務中，使用獎勵化約法設計之獎勵達到80%成功率的訓練效率比起傳統方法提昇了50.9倍。而使用獎勵化約法之LLM獎勵協同設計系統能夠在無人干預的情況下生成接近人類設計的獎勵函數，部份任務中甚至能夠超越人類專家。我們也透過實驗指出，獎勵化約法不僅能夠減輕複雜任務設計困難、提昇學習效率，該方法還具有良好的可解釋性。透過分析訓練結果，我們能不斷改進獎勵函數，使獎勵化約法成為一個「獎勵最佳化方法」。最後，我們將獎勵化約法的設計理念做延伸，透過修改最終獎勵使代理人的行為表現出多個策略的融合，以展示獎勵化約法其廣泛的應用方式及未來發展的潛力。

摘要(英)

The emergence of Large Language Models (LLMs) has led to numerous groundbreaking advancements in the field of robotic control. However, the current systems that use LLMs to automatically design reward functions for reinforcement learning tasks face significant challenges when dealing with complex tasks. This issue presents a major hurdle not only for LLMs but also for human experts. To address this, we propose Reductionist Reinforcement Reward Design (R3D). We further integrate the generative capabilities of LLMs to create the LLM-based Reward Co-design System (LLMRCS). R3D employs a hierarchical reinforcement learning approach to decompose complex tasks into multiple sub-tasks, each with its own sub-task reward function, which are then combined to form an effective overall reward function. This method simplifies the reward design process and ensures the efficacy of the rewards for complex tasks. Building upon the previous Eureka system, the LLMRCS incorporates improvements by integrating R3D within the reward generation process and leveraging expert input during optimization. Experimental results demonstrate that R3D significantly enhances the learning efficiency for complex tasks. In the FrankaCubeStack task, rewards designed using R3D achieved training efficiency improvements of up to 50.9 times compared to traditional methods, reaching an 80% success rate. Additionally, LLMRCS can autonomously generate reward functions that are comparable to those designed by humans, and in some tasks, it even surpasses human experts. Our experiments also reveal that R3D not only reduces the difficulty of designing rewards for complex tasks and improves learning efficiency but also offers excellent explainability. By analyzing training results, we can continuously refine reward functions, positioning R3D as a method for reward optimization. Finally, we extend the design principles of R3D, demonstrating its potential to generate agents that exhibit a blend of multiple strategies by modifying the final reward. This showcases the broad applicability and future development potential of R3D.

關鍵字(中)

★ 強化學習
★ 階層式強化學習
★ 大型語言模型
★ 獎勵函數

關鍵字(英)

★ reinforcement learning
★ hierarchical reinforcement learning
★ LLM
★ reward function

論文目次

摘要 i
Abstract ii
誌謝 iii
目錄 iv
圖目錄 vi
表目錄 vii
第一章、緒論 1
1.1 研究背景 1
1.2 研究目標 4
1.3 論文架構 5
第二章、文獻回顧 6
2.1 強化學習 6
2.1.1 馬可夫決策過程 6
2.1.2 Value-based 與 Action-based 強化學習方法 7
2.1.3 Actor-critic 方法 8
2.1.4 階層式強化學習 9
2.2 LLM及其應用 11
2.2.1 LLM與任務規劃 11
2.2.2 LLM與最佳化工具 12
2.2.3 基於LLM的強化學習獎勵生成工具 13
第三章、獎勵設計方法 16
3.1 複雜任務與獎勵設計 16
3.2 獎勵化約法 21
第四章、基於LLM的獎勵協同設計系統 31
4.1 系統架構 31
4.2 獎勵化約法的實作 33
4.3 結合專家知識的最佳化流程 35
第五章、實驗 37
5.1 獎勵化約法的直接使用實驗 37
5.1.1 實驗環境 37
5.1.2 實驗結果 40
5.1.3 獎勵化約法的特性 43
5.2 LLMRCS的獎勵生成實驗 47
5.2.1 實驗環境 47
5.2.2 實驗結果 48
5.3 獎勵化約法的延伸應用實驗 52
第六章、結論與未來展望 56
6.1 結論 56
6.2 未來展望 57
參考文獻 59
附錄 61
附錄一 LLMRCS 針對獎勵化約法設計之prompts 61
R3D_code_output_tip 61
R3D_code_feedback 62
R3D_policy_feedback 63

參考文獻

[1] I. Singh, V. Blukis, A. Mousavian, A. Goyal, D. Xu, J. Tremblay, D. Fox, J. Thomason, and A. Garg, "Progprompt: Generating situated robot task plans using large language models," in 2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 11523-11530, 2023.
[2] G. Wang, Y. Xie, Y. Jiang, A. Mandlekar, C. Xiao, Y. Zhu, L. Fan, and A. Anandkumar, "Voyager: An open-ended embodied agent with large language models," arXiv preprint arXiv:2305.16291, 2023.
[3] W. Yu, N. Gileadi, C. Fu, S. Kirmani, K.-H. Lee, M. G. Arenas, H.-T. L. Chiang, T. Erez, L. Hasenclever, and J. Humplik, "Language to rewards for robotic skill synthesis," arXiv preprint arXiv:2306.08647, 2023.
[4] Y. J. Ma, W. Liang, G. Wang, D.-A. Huang, O. Bastani, D. Jayaraman, Y. Zhu, L. Fan, and A. Anandkumar, "Eureka: Human-level reward design via coding large language models," arXiv preprint arXiv:2310.12931, 2023.
[5] C. Yang, X. Wang, Y. Lu, H. Liu, Q. V. Le, D. Zhou, and X. Chen, "Large language models as optimizers," arXiv preprint arXiv:2309.03409, 2023.
[6] S. Booth, W. B. Knox, J. Shah, S. Niekum, P. Stone, and A. Allievi, "The perils of trial-and-error reward design: misdesign through overfitting and invalid task specifications," in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 5, pp. 5920-5929, 2023.
[7] V. Makoviychuk, L. Wawrzyniak, Y. Guo, M. Lu, K. Storey, M. Macklin, D. Hoeller, N. Rudin, A. Allshire, and A. Handa, "Isaac gym: High performance gpu-based physics simulation for robot learning," arXiv preprint arXiv:2108.10470, 2021.
[8] S. Pateria, B. Subagdja, A.-h. Tan, and C. Quek, "Hierarchical reinforcement learning: A comprehensive survey," ACM Computing Surveys (CSUR), vol. 54, no. 5, pp. 1-35, 2021.
[9] M. L. Puterman, "Markov decision processes," Handbooks in operations research and management science, vol. 2, pp. 331-434, 1990.
[10] H. Van Hasselt, A. Guez, and D. Silver, "Deep reinforcement learning with double q-learning," in Proceedings of the AAAI conference on artificial intelligence, vol. 30, no. 1, 2016.
[11] R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour, "Policy gradient methods for reinforcement learning with function approximation," Advances in neural information processing systems, vol. 12, 1999.
[12] S. Gronauer and K. Diepold, "Multi-agent deep reinforcement learning: a survey," Artificial Intelligence Review, pp. 1-49, 2022.
[13] T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V. Kumar, H. Zhu, A. Gupta, and P. Abbeel, "Soft actor-critic algorithms and applications," arXiv preprint arXiv:1812.05905, 2018.
[14] G. Kwon, B. Kim, and N. K. Kwon, "Reinforcement Learning with Task Decomposition and Task-Specific Reward System for Automation of High-Level Tasks," Biomimetics, vol. 9, no. 4, p. 196, 2024.
[15] R. T. Icarte, T. Q. Klassen, R. Valenzano, and S. A. McIlraith, "Reward machines: Exploiting reward function structure in reinforcement learning," Journal of Artificial Intelligence Research, vol. 73, pp. 173-208, 2022.
[16] Z. Juozapaitis, A. Koul, A. Fern, M. Erwig, and F. Doshi-Velez, "Explainable reinforcement learning via reward decomposition," in IJCAI/ECAI Workshop on explainable artificial intelligence, 2019.
[17] Y. Septon, T. Huber, E. André, and O. Amir, "Integrating policy summaries with reward decomposition for explaining reinforcement learning agents," in International Conference on Practical Applications of Agents and Multi-Agent Systems, pp. 320-332, 2023.
[18] C.-H. Chen, M.-Y. Lin, and X.-C. Guo, "High-level modeling and synthesis of smart sensor networks for Industrial Internet of Things," Computers & Electrical Engineering, vol. 61, pp. 48-66, 2017.
[19] S. Gronauer and K. Diepold, "Multi-agent deep reinforcement learning: a survey," Artificial Intelligence Review, vol. 55, no. 2, pp. 895-943, 2022.

指導教授

陳慶瀚(Ching-Han Chen)

審核日期

2024-7-23

推文