近年來雲端運算技術日益成熟,大多數企業都選擇將其服務佈署至雲端環境運行,由於雲端技術所帶來的擴展性與方便性,雲端環境相較於實體環境對於資源能以低成本的方式動態調整,能夠妥善利用完整的機器資源,因此OpenStack成為建置企業雲的熱門選項。然而企業仍著重於服務的不中斷性,也就是雲端的高可用性(High Availability, HA),然而OpenStack對於使用者之虛擬機器並沒有一套完整的HA機制。而本研究首先提出軟體定義運算叢集(Software-Defined High Availability Cluster, SDHAC)的機制,透過邏輯性地切割運算資源成多個不同之SDHAC,並根據不同需求設置每個叢集之HA策略,使管理者能夠更輕易地管理與分配雲端資源。本研究基於SDHAC之上,針對叢集內部之運算節點與虛擬機器發展一套自動化錯誤偵測與復原機制,除了監控運算節點之軟體服務狀態外,亦與IPMI(Intelligent Platform Management Interface)結合提供硬體層級的監控,像是作業系統、電源及硬體內部之溫度與電壓感測器,若偵測出錯誤則針對本研究提出之錯誤模型(Failure Model)進行復原程序。本研究提出之HA系統由於結合IPMI介面,因此大幅下降錯誤偵測之時間,並提供更完善之復原機制,提高了OpenStack針對虛擬機器之高可用性。;In recent years, virtualized cloud computing has become more and more mature. Most enterprises decide to deploy their services on a virtualized cloud platform because of its elasticity and manageability. Compared to traditional computing platforms, the virtualized cloud platform can automatically adjust the computing resources in response to the change of users’ requirements. OpenStack is a popular virtualized cloud computing project that facilitates building such a cloud platform, where computations are carrying on virtual machines. In the past, we have proposed and implemented a cloud platform that supports the concept of Software-Defined High Availability Cluster (SDHAC), to address the problem of cloud platform availability and manageability. This mechanism can logically divide the computing pool into multiple HA clusters, and the administrators can apply different HA policies to different software-defined HA clusters according to different demands. This research focuses on the issue of fast failure detection and recovery on a platform with Software-Defined High Availability Clusters. The proposed system supports the use of IPMI machines, which are the computers with the interface for fast hardware state detection, and therefore it can efficiently identify the root cause of a failure. In addition, our proposed system provides a complete set of recovery features such as VM recovery and machine recovery when IPMI is used. Our experimental results show that, the proposed system with IPMI machines can achieve higher availability than the traditional system with the heart-beating failure-detection approach.