應用於地圖探索之深度強化學習演算法與硬體架構設計;Deep Reinforcement Learning Algorithm and Architecture Design for Solving Maze Problem

NCU Institutional Repository > 資訊電機學院 > 電機工程研究所 > 博碩士論文 > Item 987654321/82889

請使用永久網址來引用或連結此文件: https://ir.lib.ncu.edu.tw/handle/987654321/82889

題名:	應用於地圖探索之深度強化學習演算法與硬體架構設計;Deep Reinforcement Learning Algorithm and Architecture Design for Solving Maze Problem
作者:	蘇俊達;Su, Juyn-Da
貢獻者:	電機工程學系
關鍵詞:	深度強化學習;優先經驗回放;Deep reinforcement learning;prioritized experience replay
日期:	2020-01-21
上傳時間:	2020-06-05 17:40:24 (UTC+8)
出版者:	國立中央大學
摘要:	深度強化學習在電腦性能爆發性成長的現在受到重視，它是通過假想的代理人與環境互動進而達到學習的效果。本論文將使用深度強化學習演算法架構與優先經驗回放來解決地圖探索問題，透過不同12×12的地圖型態來檢驗其學習效果，當最短路徑的較短且陷阱距離初始點較近，高精度可能會造成學習的擾動；當最短路徑較長且陷阱距離初始點較遠，學習時間會增加且精度需求也會提高。基於地圖探索問題，我們設計了硬體加速器來加速深度強化學習的訓練與推論的運算速度，深度強化學習中的神經網路架構將包含：144個輸入層節點、72個隱藏層節點與4個輸出層節點，我們使用多層感知器來構成深度Q網路。硬體架構包含多個處理單元(Processing element, PE)，為了提供可重配置性，處理單元中包含乘法器與加法器各4個，是網路架構所需要的最小單元，容易在排程時調整計算順序，支援前向傳遞與反向傳遞，並使參數更新更加有效率。另外，因為深度強化學習的時間相關性，硬體的精度要求更高，因為量化誤差將會隨著參數更新而回饋回神經網路之中，在處理單元需支援高精度的運算下，我們提出運用區塊浮點數的概念，客製化的區塊浮點數表示法包含1個符號位元、7個指數項位元、24個分數項位元與8種固定指數項設定，可降低使用浮點數加法器的額外負擔，在以不損失太多精度的條件下達到減少硬體複雜度的效果。學習效果與使用浮點數加法器相比僅劣化了0.4步為最短路徑之1.42%。此硬體架構總共使用兩組神經網路運算單元，一個神經網路運算單元即可在155個時脈數完成前向傳遞，兩個神經網路運算單元可以在156個時脈數完成反向傳遞。合成結果顯示一浮點數乘法器搭配客製化區塊浮點數加法器相較於搭配浮點數加法器可以有32.8%的面積節省。;Deep reinforcement learning prevails recently due to the explosive growth of computation power. The agent is able to learn from the experiences of interaction with the environment. In this thesis, an architecture with custom block floating-point format is proposed to accelerate both inference and training for deep reinforcement learning algorithm . We apply deep reinforcement learning to different 12×12 maps. It shows that when the shortcut contains only a few steps and the trap is closer to the start, higher quantization accuracy may slow down the learning speed, on the other hand when the minimal steps is longer and the trap is set in the middle of the path, higher quantization accuracy is essential. The deep Q network that we implement has 144 input nodes, 72 nodes in hidden layer and 4 output nodes. Our hardware architecture consists of multiple processing elements (PEs). The PE having 4 multipliers and 4 adders is the basic building block that can be configured to support different applications. The custom block floating-point format contains 1 sign bit, 7 exponent bits and 24 mantissa bits . The block floating-point representation uses shared exponent that can eliminate the necessity of floating-point adders. This architecture can do one inference in 155 clock cycles and one training in 156 clock cycles. The utilization of this work is 1.4 times higher than the conventional work. From the synthesis result, the cell area of one floating-point multiplier and one custom block- floating point adder has 32.8% of area reduction than the one with one floating-point multiplier and one floating-point adder.
顯示於類別:	[電機工程研究所] 博碩士論文

文件中的檔案:

檔案	描述	大小	格式	瀏覽次數
index.html		0Kb	HTML	243	檢視/開啟

在NCUIR中所有的資料項目都受到原著作權保護.

社群 sharing

資料載入中.....