基於事後近端策略優化的深度強化學習機械手臂控制

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：36

、訪客IP：3.144.172.38

姓名

蘇聖哲(Sheng-Che Su) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

基於事後近端策略優化的深度強化學習機械手臂控制
(Hindsight Proximal Policy Optimization based Deep Reinforcement Learning Manipulator Control)

相關論文

★ 整合GRAFCET虛擬機器的智慧型控制器開發平台	★ 分散式工業電子看板網路系統設計與實作
★ 設計與實作一個基於雙攝影機視覺系統的雙點觸控螢幕	★ 智慧型機器人的嵌入式計算平台
★ 一個即時移動物偵測與追蹤的嵌入式系統	★ 一個固態硬碟的多處理器架構與分散式控制演算法
★ 基於立體視覺手勢辨識的人機互動系統	★ 整合仿生智慧行為控制的機器人系統晶片設計
★ 嵌入式無線影像感測網路的設計與實作	★ 以雙核心處理器為基礎之車牌辨識系統
★ 基於立體視覺的連續三維手勢辨識	★ 微型、超低功耗無線感測網路控制器設計與硬體實作
★ 串流影像之即時人臉偵測、追蹤與辨識─嵌入式系統設計	★ 一個快速立體視覺系統的嵌入式硬體設計
★ 即時連續影像接合系統設計與實作	★ 基於雙核心平台的嵌入式步態辨識系統

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 (2028-7-14以後開放)

摘要(中)

現今工廠智慧自動化的需求日漸增加，傳統的機械手臂在工廠中執行簡單自動化模式工作，深度強化學習則能夠能讓機械手臂執行更複雜的工作。在機器人領域的深度強化學習經常要面對困難的學習任務，在三維且連續的環境中，使得機器人難以獲得獎勵，這種環境稱為稀疏獎勵環境。為了克服此一問題，本研究提出了基於事後近端策略優化的深度強化學習(HPPO，Hindsight Proximal Policy Optimization)方法，用於機械手臂智慧控制。該方法結合了PPO(Proximal Policy Optimization)算法和HER(Hindsight Experience Replay)的想法，提升PPO在稀疏獎勵環境的適應性和樣本使用率。不同於傳統強化學習架構，我們採用Multi-goal概念，使Agent在與環境互動時有明確的目標，並且參考HER算法中的假資料生成，使Agent能夠從失敗中學習，進而更快達成目標。我們在機械手臂控制的模擬環境中進行了一系列實驗，並與其他深度強化學習進行比較，實驗結果表明以PPO作為核心算法改良的HPPO效果有顯著的提升，HPPO在稀疏獎勵環境中適應性較佳，並提高了樣本使用率，使訓練效率提升，驗證了HPPO使用於機械手臂的實用性，並且能以此方法為基礎應用於多種機器人的控制應用。

摘要(英)

The demand for intelligent automation in factories has been increasing, with traditional manipulator performing simple automation tasks. Deep reinforcement learning enables manipulator to handle more complex tasks. However, the field of robotics faces challenging learning tasks, particularly in sparse reward environments, where robots struggle to obtain rewards. To overcome this issue, this study proposes a method called Hindsight Proximal Policy Optimization (HPPO) based on proximal policy optimization (PPO) and the idea of Hindsight Experience Replay (HER) for intelligent control of robotic arms. HPPO combines the PPO algorithm with the concept of multi-goal reinforcement learning, providing the agent with explicit goals during interactions with the environment. Additionally, it leverages the generation of fictitious data from HER to enable the agent to learn from failures and achieve goals more efficiently. A series of experiments were conducted in a simulated environment for robotic arm control, comparing HPPO with other deep reinforcement learning methods. The results demonstrate significant improvements in HPPO, which exhibits better adaptability and sample utilization in sparse reward environments. The training efficiency is enhanced, validating the practicality of HPPO for robotic arm control and its potential application in various robot control scenarios.

關鍵字(中)

★ 機械手臂
★ 深度強化學習
★ 事後近端策略優化
★ 機器人控制

關鍵字(英)

論文目次

摘要 I
Abstract II
致謝 III
目錄 IV
圖目錄 VI
表目錄 VIII
第一章、緒論 1
1.1 研究背景 1
1.2 研究目標 3
1.3 論文架構 3
第二章、深度強化學習文獻回顧 5
2.1 馬可夫決策過程(Markov decision process , MDP) 5
2.2 DQN(Deep Q-Learning) 5
2.3 Policy Gradient 7
2.4 Actor-Critic 8
2.5 PPO(Proximal Policy Optimization) 9
2.6 Hindsight Experience Replay 10
2.7 HPG(Hindsight Policy Gradient) 11
第三章、深度強化學習機械手臂系統設計 13
3.1 Multi-Goal 13
3.2 Hindsight Proximal Policy Optimization 14
3.3 連續動作空間(Continuous Action Space) 17
3.4 HPPO階層式架構 18
3.4.1 Hindsight 資料生成模組 19
3.4.2 Actor-Critic更新模組 20
3.5HPPO離散事件建模 21
3.5.1 環境模擬離散事件建模 21
3.5.2 Hindsight 資料生成離散事件建模 22
3.5.3 Actor-Critic更新離散事件建模 23
第四章、 HPPO設計與實作 26
4.1軟硬體實驗環境 26
4.2 模擬環境介紹 27
4.2.1 Bit-flipping Environment 27
4.2.2 Fetch Robotics Environment 28
4.3實驗結果 30
4.3.1 Bit-flipping實驗結果 30
4.3.2 Fetch Robotics實驗結果 32
4.4 虛擬到現實的挑戰 35
第五章、結論與未來展望 37
5.1結論 37
5.2未來展望 38
參考文獻 39

參考文獻

[1] C. J. Watkins and P. Dayan, "Q-learning," Machine learning, vol. 8, pp. 279-292, 1992.
[2] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, "Playing atari with deep reinforcement learning," arXiv preprint arXiv:1312.5602, 2013.
[3] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis, "Mastering the game of Go with deep neural networks and tree search," Nature, vol. 529, no. 7587, pp. 484-489, 2016.
[4] H. Baier and P. D. Drake, "The power of forgetting: Improving the last-good-reply policy in Monte Carlo Go," IEEE Transactions on Computational Intelligence and AI in Games, vol. 2, no. 4, pp. 303-309, 2010.
[5] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, and A. Bolton, "Mastering the game of go without human knowledge," nature, vol. 550, no. 7676, pp. 354-359, 2017.
[6] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, and T. Graepel, "Mastering chess and shogi by self-play with a general reinforcement learning algorithm," arXiv preprint arXiv:1712.01815, 2017.
[7] J. Schrittwieser, I. Antonoglou, T. Hubert, K. Simonyan, L. Sifre, S. Schmitt, A. Guez, E. Lockhart, D. Hassabis, and T. Graepel, "Mastering atari, go, chess and shogi by planning with a learned model," Nature, vol. 588, no. 7839, pp. 604-609, 2020.
[8] O. M. Andrychowicz, B. Baker, M. Chociej, R. Jozefowicz, B. McGrew, J. Pachocki, A. Petron, M. Plappert, G. Powell, and A. Ray, "Learning dexterous in-hand manipulation," The International Journal of Robotics Research, vol. 39, no. 1, pp. 3-20, 2020.
[9] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, "Proximal policy optimization algorithms," arXiv preprint arXiv:1707.06347, 2017.
[10] I. Akkaya, M. Andrychowicz, M. Chociej, M. Litwin, B. McGrew, A. Petron, A. Paino, M. Plappert, G. Powell, and R. Ribas, "Solving rubik′s cube with a robot hand," arXiv preprint arXiv:1910.07113, 2019.
[11] H. Nguyen and H. La, "Review of deep reinforcement learning for robot manipulation," in 2019 Third IEEE International Conference on Robotic Computing (IRC), pp. 590-595, 2019.
[12] N. Vithayathil Varghese and Q. H. Mahmoud, "A survey of multi-task deep reinforcement learning," Electronics, vol. 9, no. 9, p. 1363, 2020.
[13] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, "Continuous control with deep reinforcement learning," arXiv preprint arXiv:1509.02971, 2015.
[14] S. Fujimoto, H. Hoof, and D. Meger, "Addressing function approximation error in actor-critic methods," in International conference on machine learning, pp. 1587-1596, 2018.
[15] M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew, J. Tobin, O. Pieter Abbeel, and W. Zaremba, "Hindsight experience replay," Advances in neural information processing systems, vol. 30, 2017.
[16] T. D. Kulkarni, K. Narasimhan, A. Saeedi, and J. Tenenbaum, "Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation," Advances in neural information processing systems, vol. 29, 2016.
[17] M. L. Puterman, "Markov decision processes," Handbooks in operations research and management science, vol. 2, pp. 331-434, 1990.
[18] R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour, "Policy gradient methods for reinforcement learning with function approximation," Advances in neural information processing systems, vol. 12, 1999.
[19] I. Grondman, L. Busoniu, G. A. Lopes, and R. Babuska, "A survey of actor-critic reinforcement learning: Standard and natural policy gradients," IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol. 42, no. 6, pp. 1291-1307, 2012.
[20] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, "Trust region policy optimization," in International conference on machine learning, pp. 1889-1897, 2015.
[21] T. Schaul, D. Horgan, K. Gregor, and D. Silver, "Universal value function approximators," in International conference on machine learning, pp. 1312-1320, 2015.
[22] P. Rauber, A. Ummadisingu, F. Mutz, and J. Schmidhuber, "Hindsight policy gradients," arXiv preprint arXiv:1711.06006, 2017.
[23] 郭子聖, "事後近端策略優化於增強式學習之演算法," 國立交通大學, 碩士論文, 2018.
[24] H. Zhang, S. Bai, X. Lan, D. Hsu, and N. Zheng, "Hindsight trust region policy optimization," arXiv preprint arXiv:1907.12439, 2019.
[25] C.-H. Chen, M.-Y. Lin, and X.-C. Guo, "High-level modeling and synthesis of smart sensor networks for Industrial Internet of Things," Computers & Electrical Engineering, vol. 61, pp. 48-66, 2017.

指導教授

陳慶瀚(Ching-Han Chen)

審核日期

2023-7-25

推文