隨著行動流量呈等比級數增長,巨量多輸入多輸出 (Massive Multi-Input-Multi-Output) 系統被視為下一代無線通訊系統中一項關鍵的技術,相較於傳統MIMO系統在頻譜效率、可靠性、傳輸速度與波束成型有更好的改善,然而隨著天線數的增長,伴隨而來的是指數型成長的運算複雜度。最小均方誤差 (Minimum Mean Square Error) 解能以線性疊代的方式去實現並逼近最大似然解 (Maximum Likelihood, ML) ,但其中格拉姆矩陣 (Gram matrix) 反矩陣運算的時間複雜度O(N_t^3),N_t代表上行端使用者數量,隨著使用者增加,其硬體實現會越加困難。近代的文獻中,解決128×8 (下行端128根天線,上行端8根天線) 瑞利衰落頻道 (Rayleigh fading channel) 的硬體架構已發展得相當成熟,然而這些架構的演算法往往無法再處理更多的上行使用者,因此本論文提出一個全新的演算法架構來挑戰128×32的陣列通道。前端採用加速權重諾伊曼級數展開式 (Accelerated Weighted Neumann Series Expansion) 來取得一個較佳的初始值,後端迭代將精化雅可比 (Refinement of Jacobi) 演算法加入鬆弛因子 (Relaxation factor) 來做調整,只需經過兩次迭代即可達到近似MMSE的效能;硬體實現上採用雙脈動陣列 (Dual Systolic array) 來達成高收斂速度與高硬體效率,此外因為演算法中矩陣的重複使用以及格拉姆矩陣的對稱性,大大地節省了硬體資源。為了提升吞吐量,原先需要396個時脈運算才能完成一次輸出,經過三級管線架構處理,每一級只需要132個時脈就能處理下一筆資料。最後經由對數似然比 (Log Likelihood Ratio) 配合格雷碼 (Gray code) 的星座圖簡化軟性輸出值的運算。晶片實作上採用TSMC 40 nm製程,核心面積為3.04 mm^2,最高操作頻率為510 MHz且功率消耗為752 mW,並可達到742 Mbps的傳輸速度。;Massive Multi-Input-Multi-Output (MIMO) system is considered as one of the key technologies for the next-generation wireless networks in order to satisfy the geometric growth of mobile data traffic. It increases the spectral efficiency, link reliability, throughput and beamforming gain compared to traditional MIMO system. However, the use of more antennas is always accompanied by the exponential growth of computational complexity. We can linearly and iteratively apply MMSE detection which approaches ML performance with O(N_t^3) complexity of Gram matrix inversion, where N_t is the number of transmit antennas. The more antennas increase, the harder hardware realizes. Recent research which focus on hardware implementation of 128×8 Rayleigh fading channel has grown into a mature technology. Nevertheless, these algorithms are often unable to handle more uplink users. This paper proposes a whole new algorithm to face challenges with 128×32 channel model. First, Accelerated Weighted Neumann Series Expansion as a pre-iteration-based method is presented to get a better initial value. Second, Refinement of Jacobi as an algorithm adjusted by a Relaxation factor achieves near MMSE performance with only two iterations. Third, a dual systolic array is utilized to get high convergence rate and high hardware efficiency. According to reuse and symmetry of the matrix, this thesis reduces the computation of gram matrix value which only need to compute lowest. This architecture needs 396 clock cycles to accomplish one complete output. In order to increase throughput, it just needs 132 clock cycles to process another signal with a three-stage pipeline structure. Finally, the LLR with constellation diagram of Gray code is introduced to reduce computing load. The chip design is implemented in TSMC 40 nm CMOS technology. The core area is 3.04 mm^2, maximum frequency is 510 MHz, and dynamic power consumption is 752 mW. Most important of all, the throughput achieves 742 Mbps.