在雲端的時代,虛擬化技術(Virtualization Technology)已被廣泛的運用,使實體伺服器可以邏輯上切割成數台虛擬機器來提供不同類型的服務。然而虛擬化技術卻會因各種原因的錯誤而造成服務中斷,例如實體機器的故障會影響執行於其上的虛擬機器,導致虛擬機器的可用性下降,連帶影響使用者使用該虛擬機器上的服務。雖然在一般電腦架構下所能偵測的錯誤及方式有限,但若在支援IPMI (Intelligent Platform Management Interface)硬體的ATCA(Advanced Telecommunications Computing Architecture)工業電腦架構下,我們就可以利用IPMI快速偵測硬體的現狀並快速解決問題。在本研究中,我們整合了ATCA工業電腦與KVM虛擬化技術,提出一個對稱型的容錯系統。系統藉由ATCA硬體加速偵測伺服器錯誤的能力,快速的將偵測到的錯誤分類且尋找出對應的回復機制。然後,容錯系統會將發生錯誤的伺服器上的虛擬機器在備援伺服器虛擬機器回復,以減輕單點故障對虛擬機器的影響。本系統最後與其它相近的虛擬化技術在同樣的硬體上測試容錯效能並進行比較,我們發現本系統在降低服務暫停時間,也就是提升可用性方面,有顯著的優勢。;The virtualization technology has been widely used in today’s cloud computing datacenters. With the virtualization technology, each physical machine in a datacenter can be logically divided into several virtual machines, on which different types of software services can host. However, many reasons may decrease the availability of the whole system. For example, a failed physical machine automatically fails all virtual machines on the physical machine, and consequently fails every software service on the virtual machines. It is difficult to detect failures efficiently in a general-purpose computer architecture because the hardware cannot provide enough information for fast failure detection. On the contrary, the ATCA (Advanced Telecommunications Computing Architecture) physical machines provide high hardware availability, and support IPMI (Intelligent Platform Management Interface) that can quickly detect the hardware status. In order to provide a solution for high system availability, we develop a novel failure model and design a symmetric fault-tolerant mechanism using ATCA physical machines and KVM accordingly in this study. The proposed fault-tolerant mechanism divides ATCA physical machines into pairs, such that each machine of a pair supports fault tolerance for each other. Once a failure is detected in the physical machine layer or the virtualization layer, the failed virtual machines are then recovered on the other physical machine. We have compared the proposed fault-tolerance mechanism with another prior VM-based fault-tolerance tool. The results show that the proposed mechanism significantly reduces the service downtime. That is, it provides better system availability for software services running on the virtual machines.