應用於地圖探索之深度強化學習演算法與硬體架構設計

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：98

、訪客IP：3.142.255.150

姓名

蘇俊達(Juyn-Da Su) 查詢紙本館藏

畢業系所

電機工程學系

論文名稱

應用於地圖探索之深度強化學習演算法與硬體架構設計
(Deep Reinforcement Learning Algorithm and Architecture Design for Solving Maze Problem)

相關論文

★ 具輸出級誤差消除機制之三位階三角積分D類放大器設計	★ 應用於無線感測網路之多模式低複雜度收發機設計
★ 用於數位D類放大器的高效能三角積分調變器設計	★ 交換電容式三角積分D類放大器電路設計
★ 適用於平行處理及排程技術的無衝突定址法演算法之快速傅立葉轉換處理器設計	★ 適用於IEEE 802.11n之4×4多輸入多輸出偵測器設計
★ 應用於無線通訊系統之同質性可組態記憶體式快速傅立葉處理器	★ 3GPP LTE正交分頻多工存取下行傳輸之接收端細胞搜尋與同步的設計與實現
★ 應用於3GPP-LTE下行多天線接收系統高速行駛下之通道追蹤與等化	★ 適用於正交分頻多工系統多輸入多輸出訊號偵測之高吞吐量QR分解設計
★ 應用於室內極高速傳輸無線傳輸系統之設計與評估	★ 適用於3GPP LTE-A之渦輪解碼器硬體設計與實作
★ 下世代數位家庭之千兆級無線通訊系統	★ 協作式通訊於超寬頻通訊系統之設計
★ 適用於3GPP-LTE系統高行車速率基頻接收機之設計	★ 多使用者多輸入輸出前編碼演算法及關鍵組件設計

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 (2025-1-20以後開放)

摘要(中)

深度強化學習在電腦性能爆發性成長的現在受到重視，它是通過假想的代理人與環境互動進而達到學習的效果。本論文將使用深度強化學習演算法架構與優先經驗回放來解決地圖探索問題，透過不同12×12的地圖型態來檢驗其學習效果，當最短路徑的較短且陷阱距離初始點較近，高精度可能會造成學習的擾動；當最短路徑較長且陷阱距離初始點較遠，學習時間會增加且精度需求也會提高。基於地圖探索問題，我們設計了硬體加速器來加速深度強化學習的訓練與推論的運算速度，深度強化學習中的神經網路架構將包含：144個輸入層節點、72個隱藏層節點與4個輸出層節點，我們使用多層感知器來構成深度Q網路。硬體架構包含多個處理單元(Processing element, PE)，為了提供可重配置性，處理單元中包含乘法器與加法器各4個，是網路架構所需要的最小單元，容易在排程時調整計算順序，支援前向傳遞與反向傳遞，並使參數更新更加有效率。另外，因為深度強化學習的時間相關性，硬體的精度要求更高，因為量化誤差將會隨著參數更新而回饋回神經網路之中，在處理單元需支援高精度的運算下，我們提出運用區塊浮點數的概念，客製化的區塊浮點數表示法包含1個符號位元、7個指數項位元、24個分數項位元與8種固定指數項設定，可降低使用浮點數加法器的額外負擔，在以不損失太多精度的條件下達到減少硬體複雜度的效果。學習效果與使用浮點數加法器相比僅劣化了0.4步為最短路徑之1.42%。此硬體架構總共使用兩組神經網路運算單元，一個神經網路運算單元即可在155個時脈數完成前向傳遞，兩個神經網路運算單元可以在156個時脈數完成反向傳遞。合成結果顯示一浮點數乘法器搭配客製化區塊浮點數加法器相較於搭配浮點數加法器可以有32.8%的面積節省。

摘要(英)

Deep reinforcement learning prevails recently due to the explosive growth of computation power. The agent is able to learn from the experiences of interaction with the environment. In this thesis, an architecture with custom block floating-point format is proposed to accelerate both inference and training for deep reinforcement learning algorithm . We apply deep reinforcement learning to different 12×12 maps. It shows that when the shortcut contains only a few steps and the trap is closer to the start, higher quantization accuracy may slow down the learning speed, on the other hand when the minimal steps is longer and the trap is set in the middle of the path, higher quantization accuracy is essential. The deep Q network that we implement has 144 input nodes, 72 nodes in hidden layer and 4 output nodes. Our hardware architecture consists of multiple processing elements (PEs). The PE having 4 multipliers and 4 adders is the basic building block that can be configured to support different applications. The custom block floating-point format contains 1 sign bit, 7 exponent bits and 24 mantissa bits . The block floating-point representation uses shared exponent that can eliminate the necessity of floating-point adders. This architecture can do one inference in 155 clock cycles and one training in 156 clock cycles. The utilization of this work is 1.4 times higher than the conventional work. From the synthesis result, the cell area of one floating-point multiplier and one custom block- floating point adder has 32.8% of area reduction than the one with one floating-point multiplier and one floating-point adder.

關鍵字(中)

★ 深度強化學習
★ 優先經驗回放

關鍵字(英)

★ Deep reinforcement learning
★ prioritized experience replay

論文目次

應用於地圖探索之深度強化學習演算法與硬體架構設計中文摘要 I
Deep Reinforcement Learning Algorithm and Architecture Design for Solving Maze Problem Abstract II
目錄 III
表目錄 VI
圖目錄 VII
第一章緒論 1
1.1 簡介 1
1.2 研究動機 1
1.3 論文組織 2
第二章強化學習 (Reinforcement Learning) 3
2.1 馬可夫決策過程(Markov Decision Process) 3
2.1.1 馬可夫屬性(Property) 3
2.1.2 馬可夫鏈(或稱為馬可夫過程) 4
2.1.3 馬可夫獎勵過程(Markov Reward Process, MRP) 4
2.1.4 馬可夫決策過程(Markov Decision Process, MDP) 5
2.2 Q學習(Q-Learning)與SARSA 6
2.2.1 ε-貪婪策略(ε-greedy) 6
2.2.2 Q學習 6
2.2.3 SARSA 8
第三章深度強化學習(Deep Reinforcement Learning) 10
3.1 深度學習 (Deep Learning) 10
3.1.1 多層感知器(Multilayer Perceptron) 10
3.1.2 神經網路(Neural Networks) 11
3.2 深度Q網路(DQN) 13
3.3 優先經驗回放(Prioritized Experience Replay) 15
3.3.1 經驗回放 15
3.3.2 優先經驗回放 17
3.4 軟體模擬結果 19
3.4.1 模擬環境選擇與實現 19
3.4.2 模擬結果 21
第四章硬體架構設計與考量 37
4.1 深度強化學習硬體架構 37
4.2 硬體資料型態 38
4.2.1 數值動態範圍分析 38
4.2.2 硬體字元長度模擬分析 50
4.2.3 區塊浮點數(Block Floating-Point, BFP)之設計 51
4.3 硬體設計與考量 53
4.3.1 多層感知器(MLP)架構設計 53
4.3.2 處理單元(Process Element)設計 55
4.3.3 ε-貪婪策略 57
4.3.4 優先經驗回放樹狀架構 57
4.3.5 資料流控制與排程 58
第五章模擬結果比較 68
5.1 硬體運算模擬結果 68
5.2 硬體改良後之模擬結果 71
5.3 綜合比較 73
第六章結論 82
參考資料 83

參考文獻

[1] V.Mnih, K.Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra and M. Riedmiller, “Playing Atari with Deep Reinforcement Learning,” in NIPS Deep Learning Workshop, 2013.
[2] D. Silver, “Lecture 5: Model-Free Control”, Lecture Note of UCL Course on RL, 2015
[3] T. Schaul, J. Quan, I. Antonoglou and D. Silver, “Prioritized Experience Replay,” in International Conference on Learning Representations, 2016.
[4] M. Drumond, T. Lin, M. Jaggi, B. Falsafi, “Training DNNs with Hybrid Block Floating Point,” arXiv:1804.01526v4, Dec 2018.
[5] P. R. Gankidi et al, “FPGA Architecture for Deep Learning and its application to Planetary Robotics,” in Proc. 2017 IEEE Aerospace Conf., pp. 1-9
[6] A. Amravati, S.B. Nasir, S. Thangadurai, I. Yoon, A. Raychowdury, “A 55nm Time-domain mixed-signal neuronmorphic accelerator with stochastic synapses and embedded reinforcement learning for autonomous micro- robots,” 2018 IEEE International Solid- State Circuits Conference-(ISSCC), pp. 124-124
[7] Y. Kim et al., “A 0.55 V 1.1 mW Artificial Intelligence Processor with On- Chip PVT Compensation for Autonomous Mobile Robots,” IEEE Trans. Circuits Syst. I. Reg. Papers, Vol. 65, No. 2, Feb 2018
[8] D. Elam, C. lovescu, “A Block Floating Point Implementation for an N-Point FFT on the TMS320C55x DSP”, Texas Instruments Application Report, SPRA948, Sep 2003
[9] M. Drumond, T. Lin, M. Jaggi, B. Falsafi, “Training DNNs with Hybrid Block Floating Point”, arXiv:1804.01526v4, Dec. 2018
[10] M. Hessel, J. Modayil, H. v. Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. Azar, D. Silver, “Rainbow：Combining Improvements in Deep Reinforcement Learning”, arXiv:1710.02298v1, Oct. 2017
[11] H. v. Hasselt, A. Guez, D. Sliver, “Deep Reinforcement Learning with Double Q-learning”, arXiv:1509.06461v3, Dec. 2015
[12] Z. Wang, T. Schaul, M. Hessel, H. v. Hasselt, M. Lanctot, N. d. Freitas, “Dueling Network Architectures for Deep Reinforcement Learning”, arXiv:1511.06581v3, Apr. 2016
[13] S. Gupta, A. Agrawal, K. Gopalakrishnan, P. Narayanan, “Deep Learning with Limited Numerical Precision” arXiv:1502.02551v1, Feb. 2015

指導教授

蔡佩芸(Pei-Yun Tsai)

審核日期

2020-1-21

推文