搭配優先經驗回放與經驗分類於深度強化式學習之記憶體減量技術

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：124

、訪客IP：3.133.126.46

姓名

沈楷桓(Kai-Huan Shen) 查詢紙本館藏

畢業系所

電機工程學系

論文名稱

搭配優先經驗回放與經驗分類於深度強化式學習之記憶體減量技術
(Memory Reduction through Experience Classification with Prioritized Experience Replay in Deep Reinforcement Learning)

相關論文

★ 具輸出級誤差消除機制之三位階三角積分D類放大器設計	★ 應用於無線感測網路之多模式低複雜度收發機設計
★ 用於數位D類放大器的高效能三角積分調變器設計	★ 交換電容式三角積分D類放大器電路設計
★ 適用於平行處理及排程技術的無衝突定址法演算法之快速傅立葉轉換處理器設計	★ 適用於IEEE 802.11n之4×4多輸入多輸出偵測器設計
★ 應用於無線通訊系統之同質性可組態記憶體式快速傅立葉處理器	★ 3GPP LTE正交分頻多工存取下行傳輸之接收端細胞搜尋與同步的設計與實現
★ 應用於3GPP-LTE下行多天線接收系統高速行駛下之通道追蹤與等化	★ 適用於正交分頻多工系統多輸入多輸出訊號偵測之高吞吐量QR分解設計
★ 應用於室內極高速傳輸無線傳輸系統之設計與評估	★ 適用於3GPP LTE-A之渦輪解碼器硬體設計與實作
★ 下世代數位家庭之千兆級無線通訊系統	★ 協作式通訊於超寬頻通訊系統之設計
★ 適用於3GPP-LTE系統高行車速率基頻接收機之設計	★ 多使用者多輸入輸出前編碼演算法及關鍵組件設計

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

優先經驗回放在許多強化式線上學習演算法中被廣泛的使用，使過往經驗能夠被更有效率的利用，但在其中使用大量記憶體顯著的消耗了系統存儲。因此，在這份研究中我們提出了一個分區及分類機制減輕上述的效應。由計算學習歷程中所得經驗的時差誤差(temporal-difference error)可以得知若某些情況尚未學會，將導致其誤差較大。而我們利用這個資訊設計出多個區段來控管記憶體複寫經驗：將上述經驗的時差誤差排序組成累積分布函數，並導入一個新的超參數(hyper-parameter S)控制區段數量，將累積分布函數分成S個區段且算出各區段的成員數量與其邊界；新的經驗將根據其時差誤差大小與所屬區段中的經驗的記憶體位址交換資料來改變其存活時間；另訂出一隨記憶體複寫指標(write-data pointer)移動之凍結區(frozen region)，所有歸於此區記憶體位址的經驗將不會被置換，以免經驗透過交換存活過久；隨著網路更新，此分區資訊將週期性更新，避免使用過時資訊控制記憶體。透過這個機制改變已學會經驗與延長高價值經驗的存活時間；而上述分區及分類機制由時間差分誤差的資訊計算取得。所提出的機制結合深度確定性決策梯度演算法於倒立擺和倒立雙擺問題驗證。從實驗可知，我們所提出的機制有效的減少記憶體中冗餘內容，降低記憶體內資料的相關性並以額外的時間差分誤差計算為代價達到更佳的學習效能和更少的記憶體使用量。

摘要(英)

Prioritized experience replay has been widely used in many online reinforcement learning algorithms, providing high efficiency in exploiting past experiences. However, a large replay buffer consumes system storage significantly. Thus, in this paper, a segmentation and classification scheme is proposed to change the lifetime of the well-learned experiences and the valuable experiences. The segmentation and classification are achieved by using the information of temporal-difference errors (TD errors). With the ranking of TD errors, we further calculate the cumulative density function and separate it into S segments, where S is a newly introduced hyper-parameter; then, obtain the information of each segment. The incoming new experiences will be classified to the corresponding segments according to their TD error and will be swapped with the data in the same segment. Besides, we define a frozen region, which follows the write-data pointer of the replay buffer, to avoid experiences living too long with the proposed scheme. The proposed scheme is incorporated in the Deep Deterministic Policy Gradient (DDPG) algorithm, and the Inverted Pendulum and Inverted Double Pendulum tasks are used for verification. From the experiments, our proposed mechanism can effectively remove the buffer redundancy. Thus, better learning performance with reduced memory size is achieved at the cost of additional computations of updated TD errors.

關鍵字(中)

★ 深度強化式學習
★ 深度確定性決策梯度
★ 優先經驗回放

關鍵字(英)

★ Deep reinforcement learning
★ Deep deterministic policy gradient,
★ Prioritized experience replay

論文目次

搭配優先經驗回放與經驗分類於深度強化式學習之記憶體減量技術 I
中文摘要 I
ABSTRACT II
TABLE OF CONTENTS III
LIST OF FIGURES V
LIST OF TABLES VIII
LIST OF EQUATIONS VIII
LIST OF ALGORITHMS IX
1. INTRODUCTION 1
1.1. INTRODUCTION 1
1.2. MOTIVATION 3
1.3. THESIS STRUCTURE 3
2. DEEP REINFORCEMENT LEARNING 5
2.1. MARKOV DECISION PROCESS 5
2.1.1. MARKOV PROPERTY 5
2.1.2. MARKOV PROCESS (OR MARKOV CHAIN) 6
2.1.3. MARKOV REWARD PROCESS (MRP) 7
2.1.4. MARKOV DECISION PROCESS (MDP) 9
2.2. Q LEARNING AND SARSA 10
2.2.1. Q LEARNING 10
2.2.2. SARSA 12
2.3. DEEP Q NETWORK 16
2.3.1. NEURAL NETWORK 16
2.3.2. DEEP Q NETWORK (DQN) 17
2.3.3. EXPERIENCE REPLAY IN DQN 19
2.4. DEEP DETERMINISTIC POLICY GRADIENT 22
2.4.1. ORNSTEIN-UHLENBECK PROCESS 22
2.4.2. DEEP DETERMINISTIC POLICY GRADIENT 24
2.5. OPTIMIZER 28
2.5.1. GRADIENT DESCENT 29
2.5.2. STOCHASTIC GRADIENT DESCENT (SGD) 29
2.5.3. SGD WITH MOMENTUM 30
2.5.4. SGD WITH NESTEROV ACCELERATION (NAG) 30
2.5.5. ADAPTIVE GRADIENT (ADAGRAD) 31
2.5.6. ADADELTA 32
2.5.7. ADAM 33
2.5.8. ADAMAX 33
2.5.9. NADAM 34
3. PRIORITIZED EXPERIENCE REPLAY AND PROPOSED SCHEME 35
3.1. CONVENTIONAL PRIORITIZED EXPERIENCE REPLAY 35
3.2. PROPOSED PRIORITIZED EXPERIENCE REPLAY WITH CLASSIFICATION SCHEME 39
3.2.1. CLASSIFIER - CURVE APPROXIMATION 39
3.2.2. CLASSIFICATION-SWAP – COMPLETE SEGMENTATION 44
3.2.3. CLASSIFICATION-SWAP – PARTIAL SEGMENTATION 44
3.2.4. FROZEN REGION 50
3.2.5. OVERHEAD OF CLASSIFICATION SCHEME 56
4. IMPLEMENTATION RESULTS OF PRIORITIZED EXPERIENCE REPLAY WITH CLASSIFICATION SCHEME 57
4.1. PROPOSED MECHANISM ON DDPG FOR INVERTED PENDULUM 57
4.1.1. SIMULATION ENVIRONMENT - INVERTED PENDULUM 57
4.1.2. SIMULATION RESULTS 58
4.2. PROPOSED MECHANISM ON DDPG FOR INVERTED DOUBLE PENDULUM 63
4.2.1. SIMULATION ENVIRONMENT - INVERTED DOUBLE PENDULUM 63
4.2.2. SIMULATION RESULTS 64
5. CONCLUSION 67
6. REFERENCE 69
APPENDIX A: MORE SIMULATION RESULTS 73
A.1 INVERTED PENDULUM 73
A.2 INVERTED DOUBLE PENDULUM 76

參考文獻

[1] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra and M. Riedmiller, "Playing Atari with Deep Reinforcement Learning," in NIPS Deep Learning Workshop, 2013.
[2] L. J. Lin, "Self-improving reactive agents based on reinforcement learning, planning and teaching," Machine Learning, vol. 8, pp. 293-321, 01 May 1992.
[3] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, S. Ostrovski, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg and D. Hassabis, "Human-level control through deep reinforcement learning," Nature, vol. 518, pp. 529-533, 25 Feb 2015.
[4] A. W. Moore and C. G. Atkeson, "Prioritized sweeping: Reinforcement learning with less data and less time," Machine Learning, vol. 13, pp. 103-130, 01 Oct 1993.
[5] T. Schaul, J. Quan, I. Antonoglou and D. Silver, "Prioritized Experience Replay," in International Conference on Learning Representations, 2016.
[6] S. Zhang and R. S. Sutton, "A Deeper Look at Experience Replay," arXiv e-prints, p. arXiv:1712.01275, Dec 2017.
[7] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver and D. Wierstra, "Continuous control with deep reinforcement learning," CoRR, vol. abs/1509.02971, Sep 2015.
[8] R. S. Sutton, D. McAllester, S. Singh and Y. Mansour, "Policy Gradient Methods for Reinforcement Learning with Function Approximation," in Advances in Neural Information Processing Systems 12, MIT Press, 2000, pp. 1057-1063.
[9] V. R. Konda and J. N. Tsitsiklis, "On Actor-Critic Algorithms," SIAM journal on Control and Optimization, vol. 42, pp. 1143-1166, Apr 2003.
[10] Y. Hou, L. Liu, Q. Wei, X. Xu and C. Chen, "A novel DDPG method with prioritized experience replay," in Proceedings of IEEE International Conference on Systems, Man, and Cybernetics (SMC), 2017, pp. 316-321.
[11] C. J. C. H. Watkins, "Learning from Delayed Rewards", Cambridge: King′s College, 1989.
[12] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, Cambridge, MA: MIT Press, 1998.
[13] G. A. Rummery and M. Niranjan, "On-Line Q-Learning Using Connectionist Systems," Cambridge University Engineering Department, Cambridge, 1994.
[14] K. Fukushima, "Neocognitron: A Self-Organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position," Biological Cybernetics, vol. 36, pp. 193-202, 1980.
[15] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra and M. Riedmiller, "Deterministic Policy Gradient Algorithms," in Proceedings of the 31st International Conference on Machine Learning, 2014, pp. 387-395.
[16] J. Kiefer and J. Wolfowitz, "Stochastic Estimation of the Maximum of A Regression Function," Annals of Mathematical Statistics, vol. 23, pp. 462-466, Sep 1952.
[17] N. Qian, "On the Momentum Term in Gradient Descent Learning Algorithms," Neural Networks, vol. 12, pp. 145-151, Jan 1999.
[18] Y. E. Nesterov, "A method for solving the convex programming problem with convergence rate O (1/k^2)," Dokl. Akad. Nauk SSSR, vol. 269, pp. 543-547, 1983.
[19] J. Duchi and E. Hazan and Y. Singer, "Adaptive Subgradient Methods for Online Learning and Stochastic Optimization," J. Mach. Learn. Res., vol. 12, pp. 2121-2159, Jul 2011.
[20] M. D. Zeiler, "ADADELTA: An Adaptive Learning Rate Method," CoRR, vol. abs/1212.5701, 2012.
[21] D. P. Kingma and J. Ba, "Adam: A Method for Stochastic Optimization," CoRR, vol. abs/1412.6980, 2014.
[22] T. Dozat, "Incorporating Nesterov Momentum into Adam," ICRL, 2016.
[23] E. Todorov, T. Erez and Y. Tassa, "MuJoCo: A physics engine for model-based control," in Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems, 2012, pp. 5026-5033.
[24] M. Plappert, M. Andrychowicz, A. Ray, B. McGrew, B. Baker, G. Powell, J. Schneider, J. Tobin, M. Chociej, P. Welinder, V. Kumar and W. Zaremba, Multi-Goal Reinforcement Learning: Challenging Robotics Environments and Request for Research, 2018.

指導教授

蔡佩芸(Pei-Yun Tsai)

審核日期

2019-6-25

推文