摘要: | 本論文將以深度強化學習與優先經驗回放為基礎,將其套用在12×12的地圖環境,用來解決地圖探索問題並檢驗其成效,並以此環境為主,我們設計一硬體加速器來提升深度強化學習在硬體上的訓練速度,並使用多層感知器建立深度Q網路,其中所使用的神經網路架構包含:144個輸入層節點、72個隱藏層節點和4個輸出層節點。硬體架構由估計網路運算單元、目標網路運算單元、Q價值比較器、動作選擇器、時間差分運算單元所構成,每個網路運算單元有18片網路運算單元片,網路運算單元片中包含最小的運算單元為處理單元(Processing element, PE),其中包含4個加法器與4個乘法器,由此可方便在排程調整計算順序,此硬體架構可支援前向傳遞與反向傳遞,並同時更新網路參數。為了上述步驟能在硬體上實現,我們使用區塊浮點數的概念,並將其客製化,客製化的區塊浮點數由1個符號位元、7個指數項位元、24個尾數項位元表示,並對各運算設定適當的指數項,使其降低浮點數加法器的面積,減少硬體複雜度。網路運算單元花費155個時脈完成前項傳遞,花費322個時脈訓練一筆資料。我們於Xlinx VCU128上以時脈週期為11ns進行合成與量測,確認其Q價值與優先值計算是否正確,根據合成報告顯示,使用150987個LUT、72個BRAM、290個DSP,量測結果皆無誤差,其可完整進行深度強化學習的推理模式與訓練模式。額外使用Design Compiler合成,結果則顯示客製化區塊浮點數乘加器相較於浮點數乘加器,在操作頻率400MHz時可有高達23.3%的面積節省。;In recent years, due to the development of artificial intelligence technology and the increasing popularity of mobile computing, accelerators that support neural network computing functions on edge devices have gradually become one of the options. This thesis presents hardware design on deep reinforcement learning and priority experience replay in a 12×12 map environment to solve the problem of map exploration. We build a deep Q network with multilayer perceptron, which has 144 nodes in the input layer, 72 nodes in hidden layer and 4 nodes in output layer. The hardware architecture consists of an evaluation network calculation unit, a target network calculation unit, a Q-value comparator, an action selector, and a temporal difference error operation unit. Every network calculation unit is composed of 18 network calculation slices. There are 4 adders and 4 multipliers that support various calculation modes in one Processing Elements (PEs) of the network calculation slice. This hardware architecture can compute forward propagation and back propagation, and update network parameters. In order to reduce the hardware area, we use customized block floating point format. The customized block floating point format consists of 1 sign bit, 7-bit exponent, 24-bit mantissa. An appropriate exponent is set and adjusted for each operation to reserve the required precision. It takes 155 clocks to complete the forward propagation, and 322 clocks for one back propagation. The design can be operated with a clock period of 11ns to generate the correct Q value and priority. According to the report of the resource utilization, 150987 LUTs, 72 BRAMs. 290 DSPs are used. The measurement result demonstrates that the reinforcement learning hardware design can fully realize the inferencing and training. In addition, we synthesis the multiply accumulate (MAC) with Design Compiler. Compared to the customized floating-point MAC, the synthesis results show that the customized block floating-point MAC can save up to 23.3% in area compared with the floating-point MAC at operating frequency 400MHz. |