深度強化式學習之客製化區塊浮點數硬體架構設計與優化

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：103

、訪客IP：18.218.188.227

姓名

蔣詠軒(Yung-Hsuan Chiang) 查詢紙本館藏

畢業系所

電機工程學系

論文名稱

深度強化式學習之客製化區塊浮點數硬體架構設計與優化

相關論文

★ 具輸出級誤差消除機制之三位階三角積分D類放大器設計	★ 應用於無線感測網路之多模式低複雜度收發機設計
★ 用於數位D類放大器的高效能三角積分調變器設計	★ 交換電容式三角積分D類放大器電路設計
★ 適用於平行處理及排程技術的無衝突定址法演算法之快速傅立葉轉換處理器設計	★ 適用於IEEE 802.11n之4×4多輸入多輸出偵測器設計
★ 應用於無線通訊系統之同質性可組態記憶體式快速傅立葉處理器	★ 3GPP LTE正交分頻多工存取下行傳輸之接收端細胞搜尋與同步的設計與實現
★ 應用於3GPP-LTE下行多天線接收系統高速行駛下之通道追蹤與等化	★ 適用於正交分頻多工系統多輸入多輸出訊號偵測之高吞吐量QR分解設計
★ 應用於室內極高速傳輸無線傳輸系統之設計與評估	★ 適用於3GPP LTE-A之渦輪解碼器硬體設計與實作
★ 下世代數位家庭之千兆級無線通訊系統	★ 協作式通訊於超寬頻通訊系統之設計
★ 適用於3GPP-LTE系統高行車速率基頻接收機之設計	★ 多使用者多輸入輸出前編碼演算法及關鍵組件設計

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 (2027-8-31以後開放)

摘要(中)

本論文將以深度強化學習與優先經驗回放為基礎，將其套用在12×12的地圖環境，用來解決地圖探索問題並檢驗其成效，並以此環境為主，我們設計一硬體加速器來提升深度強化學習在硬體上的訓練速度，並使用多層感知器建立深度Q網路，其中所使用的神經網路架構包含：144個輸入層節點、72個隱藏層節點和4個輸出層節點。硬體架構由估計網路運算單元、目標網路運算單元、Q價值比較器、動作選擇器、時間差分運算單元所構成，每個網路運算單元有18片網路運算單元片，網路運算單元片中包含最小的運算單元為處理單元(Processing element, PE)，其中包含4個加法器與4個乘法器，由此可方便在排程調整計算順序，此硬體架構可支援前向傳遞與反向傳遞，並同時更新網路參數。為了上述步驟能在硬體上實現，我們使用區塊浮點數的概念，並將其客製化，客製化的區塊浮點數由1個符號位元、7個指數項位元、24個尾數項位元表示，並對各運算設定適當的指數項，使其降低浮點數加法器的面積，減少硬體複雜度。網路運算單元花費155個時脈完成前項傳遞，花費322個時脈訓練一筆資料。我們於Xlinx VCU128上以時脈週期為11ns進行合成與量測，確認其Q價值與優先值計算是否正確，根據合成報告顯示，使用150987個LUT、72個BRAM、290個DSP，量測結果皆無誤差，其可完整進行深度強化學習的推理模式與訓練模式。額外使用Design Compiler合成，結果則顯示客製化區塊浮點數乘加器相較於浮點數乘加器，在操作頻率400MHz時可有高達23.3%的面積節省。

摘要(英)

In recent years, due to the development of artificial intelligence technology and the increasing popularity of mobile computing, accelerators that support neural network computing functions on edge devices have gradually become one of the options. This thesis presents hardware design on deep reinforcement learning and priority experience replay in a 12×12 map environment to solve the problem of map exploration. We build a deep Q network with multilayer perceptron, which has 144 nodes in the input layer, 72 nodes in hidden layer and 4 nodes in output layer. The hardware architecture consists of an evaluation network calculation unit, a target network calculation unit, a Q-value comparator, an action selector, and a temporal difference error operation unit. Every network calculation unit is composed of 18 network calculation slices. There are 4 adders and 4 multipliers that support various calculation modes in one Processing Elements (PEs) of the network calculation slice. This hardware architecture can compute forward propagation and back propagation, and update network parameters. In order to reduce the hardware area, we use customized block floating point format. The customized block floating point format consists of 1 sign bit, 7-bit exponent, 24-bit mantissa. An appropriate exponent is set and adjusted for each operation to reserve the required precision. It takes 155 clocks to complete the forward propagation, and 322 clocks for one back propagation. The design can be operated with a clock period of 11ns to generate the correct Q value and priority. According to the report of the resource utilization, 150987 LUTs, 72 BRAMs. 290 DSPs are used. The measurement result demonstrates that the reinforcement learning hardware design can fully realize the inferencing and training. In addition, we synthesis the multiply accumulate (MAC) with Design Compiler. Compared to the customized floating-point MAC, the synthesis results show that the customized block floating-point MAC can save up to 23.3% in area compared with the floating-point MAC at operating frequency 400MHz.

關鍵字(中)

★ 深度強化式學習之客製化區塊浮點數硬體架構設計與優化

關鍵字(英)

論文目次

目錄

深度強化式學習之客製化區塊浮點數硬體架構設計與優化 ii
摘要 ii
Abstract iii
目錄 iv
表目錄 vii
圖目錄 ix
第一章緒論 1
1.1 簡介 1
1.2 研究動機 2
1.3 論文組織 2
第二章強化式學習 3
2.1 馬可夫決策過程 (Markov decision proccess) 3
2.1.1 馬可夫屬性(Markov Property) 3
2.1.2 馬可夫鏈(Markov Chain) 4
2.1.3 馬可夫獎勵過程(Markov Reward Process) 4
2.1.4 馬可夫決策過程(Markov Decision Process) 6
2.2 Q學習與Sarsa (Q learning and SARSA) 8
2.2.1 ε-貪婪策略(ε-greedy) 8
2.2.2 Q學習(Q-Learning) 8
2.2.3 SARSA 9
2.3 深度Q網路 (Deep Q Network) 11
2.3.1 深度學習(Deep Learning) 11
2.3.2 多層感知器(Multilayer Perceptron) 11
2.3.3 神經網路 12
2.3.4 深度Q網路(DQN) 14
2.3.5 經驗回放(Prioritized Experience Replay) 18
2.3.6 優先經驗回放 20
2.4 軟體模擬結果 25
2.4.1 環境設計與實現 26
2.4.2 模擬結果 27
第三章硬體架構設計與實現 34
3.1 深度強化式學習硬體架構 34
3.2 硬體資料型態 36
3.2.1 數值動態範圍分析 36
3.2.2 硬體字元長度模擬分析 40
3.2.3 客製化區塊浮點數(Customized Block Floating-Point)設計 50
3.3 硬體設計與實現 52
3.3.1 多層感知器架構設計 53
3.3.2 處理單元(Process Elememt)設計 56
3.3.3 客製化區塊浮點數加法器 57
3.3.4 客製化區塊浮點數乘法器 64
3.3.5 時間差分運算單元 64
3.3.6 激活函數與其微分 65
3.4 資料流控制與排程 (Preprocessing) 70
3.4.1 前向傳遞資料流與排程 70
3.4.2 反向傳遞資料流與排程 76
第四章硬體實作結果與比較 85
4.1 位元仿真模型模擬結果 85
4.2 FPGA合成與量測結果 97
4.3 Design Complier合成結果比較 102
4.4 綜合比較 105
第五章結論 108
參考資料 110

參考文獻

[1] Y. Kim, D. Shin, J. Lee, Y. Lee and H. -J. Yoo, "A 0.55 V 1.1 mW Artificial Intelligence Processor With On-Chip PVT Compensation for Autonomous Mobile Robots," in IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 65, no. 2, pp. 567-580, Feb. 2018.
[2] A. Amravati, S. B. Nasir, S. Thangadurai, I. Yoon, A. Raychowdury, “A 55nm Time-domain mixed-signal neuromorphic accelerator with stochastic synapses and embedded reinforcement learning for autonomous micro-robots,” 2018 IEEE International Solid - State Circuits Conference - (ISSCC), pp. 124-126.
[3] V.Mnih, K.Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra and M. Riedmiller, “Playing Atari with Deep Reinforcement Learning,” in NIPS Deep Learning Workshop, 2013.
[4] T. Schaul, J. Quan, I. Antonoglou and D. Silver, “Prioritized Experience Replay,” in International Conference on Learning Representations, 2016.
[5] 蘇俊達(110)。,”應用於地圖探索之深度強化學習演算法與硬體架構設計。未出版碩士，國立中央大學電機工程學系，桃園市。”
[6] D. Elam, C. lovescu, “A Block Floating Point Implementation for an N-Point FFT on the TMS320C55x DSP”, Texas Instruments Application Report, SPRA948, Sep 2003
[7] S. Shao, J. Tsai, M. Mysior, W. Luk, T. Chau, A. Warren, and B. Jeppesen, “Towards hardware accelerated reinforcement learning for application specific robotic control,” in International Conference on Application specific Systems, Architectures and Processors, pp. 1–8, IEEE, 2018.
[8] J. Su, J. Liu, D. B. Thomas, and P. Y. Cheung, “Neural network based reinforcement learning acceleration on FPGA platforms,” ACM SIGARCH Computer Architecture News, vol. 44, no. 4, pp. 68–73, 2017.
[9] S. Shao and W. Luk, “Customised pearlmutter propagation: A hardware architecture for trust region policy optimisation,” in International Conference on Field Programmable Logic and Applications, pp. 1–6, IEEE, 2017.
[10] C. Guo, W. Luk, S. Q. S. Loh, A. Warren and J. Levine, "Customisable control policy learning for robotics", Proc. IEEE 30th Int. Conf. Appl.-Specific Syst. Archit. Processors (ASAP), vol. 2160, pp. 91-98, Jul. 2019.
[11] H. Cho, P. Oh, J. Park, W. Jung and J. Lee, "FA3C: FPGA-accelerated deep reinforcement learning", Proc. 24th Int. Conf. Architectural Support Program. Lang. Operating Syst., pp. 499-513, 2019.
[12] S. Shao and W. Luk, "Customised pearlmutter propagation: A hardware architecture for trust region policy optimisation", Proc. 27th Int. Conf. Field Program. Log. Appl. (FPL), pp. 1-6, Sep. 2017.
[13] G. Dinelli, G. Meoni, E. Rapuano and L. Fanucci, "Advantages and Limitations of Fully on-Chip CNN FPGA-Based Hardware Accelerator," 2020 IEEE International Symposium on Circuits and Systems (ISCAS), 2020, pp. 1-5, doi: 10.1109/ISCAS45731.2020.9180867.
[14] C. -B. Wu, C. -S. Wang and Y. -K. Hsiao, "Reconfigurable Hardware Architecture Design and Implementation for AI Deep Learning Accelerator," 2020 IEEE 9th Global Conference on Consumer Electronics (GCCE), 2020, pp. 154-155, doi: 10.1109/GCCE50665.2020.9291854.

指導教授

蔡佩芸(Pei-Yun Tsai)

審核日期

2022-8-26

推文