深度神經網絡(DNN)被廣泛地應用於人工智慧(AI)的領域,例如物件辨識及圖像分類等等。目前的深度神經網路模型通常需要大量的資料計算。為了在不同應用中滿足對效能的需求,加速器通常被用來實現深度神經網路的推論(inference)。在本文中,我們提出了一個基於小晶片(chiplet)設計方法的深度神經網路加速器架構。此架構由一個基底晶片(base die)和具有可擴展性的多個計算晶片(compute die)組成。基底晶片由靜態隨機存取記憶體(SRAM)和控制單元組成,用來處理外部動態隨機存取記憶體(DRAM)及計算晶片之間的資料傳輸。計算晶片由靜態隨機存取記憶體及處理單元(processing element, PE)組成。我們亦根據此架構提出設計空間探索(design space exploration, DSE)的方法,用來探索在資料頻寬和端到端延遲(end-to-end latency)的限制條件下,基底晶片及計算晶片可能的設計選擇。探索的結果顯示,缺陷密度(defect density, 1/mm2)及接合良率(bonding yield)是影響計算晶片的顆粒度(granularity)的主要因素。考慮在動態隨機存取記憶體頻寬為25.6 GB/s及基底晶片的引腳(I/O)數目為4096 及端到端延遲為12毫秒的情況下實現ResNet-50的推論,由一個基底晶片及兩個計算晶片組成的系統可以達到最低的製造成本。當缺陷密度提高時,將計算晶片切割成更多的數量可以得到成本降低的回報;當接合良率下降時,將計算晶片切割成較少的數量可以有效地降低成本。為了驗證所提出的基於小晶片設計方法的深度神經網路加速器架構,我們在Xilinx ZCU-102開發板上實現了一個用於MobileNet推論的加速器。此加速器由一個基底晶片及一個計算晶片所組成。實驗結果顯示,在100MHz的操作頻率下,此加速器可以達到25ms的端到端延遲。;Deep neural network (DNN) is widely used in artificial intelligence (AI) applications, e.g., object detection and image classification. A modern DNN model usually needs a large amount of computation. To meet the performance requirement of applications, an accelerator is usually designed for DNN inference. In this thesis, we consider a DNN accelerator realized by using the chiplet-based method. A chiplet-based DNN accelerator architecture is proposed, which consists of a base die and multiple compute dies for scalability. The base die is composed of SRAM buffers and controllers for handling the data transportation between the external dynamic random access memory (DRAM) and the compute dies. The compute die consists of memory units (SRAM) and compute units (processing element, PE). A design space exploration is proposed to explore possible design selections of the base die and compute dies under the constraints of data bandwidth and the end-to-end latency. The exploration results show that the defect density and the bonding yield are the dominant factors for the granularity of the compute dies. For realizing the ResNet-50 model under the constraints of 25.6 GB/s DRAM bandwidth, 4096 IOs of the base die, and 12 ms latency, two compute dies can provide the minimal fabrication cost. Partitioning with more compute dies pays off when the defect density increases; for decreasing bonding yield, partitioning with fewer compute dies lowers the cost. To verify the proposed chiplet-based DNN accelerator architecture, we implemented the chiplet-based DNN accelerator for MobileNet inference using Xilinx ZCU-102 evaluation board. The chiplet-based DNN accelerator is architectured with one base die and one compute die. The implementation results show that the 25 ms end-to-end latency can be achieved using 100MHz operation frequency.