本篇論文將設計一個深度強化式學習之硬體架構,將其應用在混合式波束合成的系統架構上,對預編碼器進行神經網路學習。在已知通道狀態資訊的前提下,使用深度確定性梯度下降算法(Deep Deterministic Policy Gradient)來取得類比預編碼器,以類比預編碼器作為狀態,代理人採取動作後,從環境得到的反饋及獎勵,以獎勵的收斂趨勢來判斷訓練是否成功。以上述的算法,我們分析演算法中各項參數對訓練結果的影響,而以通道容量作為我們評斷其效能的標準,。選定各項神經網路之參數後,設計一硬體架構來實現深度確定性策略梯度下降於混合式預編碼系統(DDPG on Hybrid Precoding Algorithm),其架構由價值函數估計網路運算單元、價值函數目標網路運算單元、動作策略估計網路運算單元、動作策略目標網路運算單元、區域記憶體、開根號器與除法器,以及控制訊號所組成,每個價值函數估計網路與目標網路運算單元都包含16片價值函數網路運算片,而每個動作策略估計網路與目標網路運算單元都包含48片價值函數網路運算片,其運算片中都包含最小的運算處理單元,包含乘法器與加法器各4個。由相同的處理單原來設計DDPG,以正向傳遞來累積訓練神經網路所需要的參數,而以反向傳遞來更新網路參數,完成一次正向與反向傳遞各需要153與322個時脈。我們將此硬體設計於Xlinx VCU128上,操作頻率達95.2MHz,報告顯示使用了536514個LUT、256個BRAM、1024個DSP,在批量大小為1的時候,我們的每秒能產生之結果(IPS)為18332,而在批量大小為32的時候,我們的IPS為211556。而為了確認神經網路輸出與網路參數更新是否正確,與軟體上的位元仿真模型進行比較,其結果無誤差,以此表示此硬體可以對深度強化式學習進行訓練與更新。;A hardware architecture for deep reinforcement learning applied to the hybrid beamforming system is proposed. Given known channel state information, the algorithm employs deep deterministic gradient descent algorithm to calculate the phase of analog precoders. The previous analog precoder phases are used as states, and the agent generates actions based on feedback and rewards defined as channel capacity from the environment. The convergence trend of the channel capacity and the comparison to the conventional algorithms demonstrate the success of training.. A hardware architecture is then designed to implement DDPG with proper parameters. The architecture includes units for critic evaluation network units, critic target network, actor evaluation network, actor target network. Each critic evaluation network and target network contains 16 computation slices, and each actor evaluation and target network comprises 48 computation slices. Each computation slice includes basic processing elements, each having four multipliers and accumulators. Both forward and backward passes are executed in the proposed hardware architecture. For each forward and backward pass, 153 and 322 clock cycles are needed. The hardware design is synthesized on Xilinx VCU128, which achieves operating frequency of 95.2MHz, utilizing 536514 LUTs, 256 BRAMs, and 1024 DSPs. To confirm the correctness of neural network outputs and parameter updates, a comparison with a bit-true model shows no discrepancies, indicating that the hardware is capable of both training and inference for deep reinforcement learning. It can accomplish 211K and 18K training and inferences per second if the batch size is equal to 1 and 32, respectively.