近年來雲端運算技術日益成熟,大多數企業都選擇將其服務佈署至雲端環境運行,由於雲端技術所帶來的擴展性與方便性,雲端環境相較於實體環境更能動態調整並有效管理運算資源,隨著開源雲端平台—OpenStack在不斷的推出更加完善的版本,也逐漸成為企業建立雲端平台的選擇之一。 因使用者將業務部署於雲端平台,且由雲端平台之運算單位虛擬機器提供服務,為使虛擬機器所提供之服務不中斷,故雲端的高可用性(High Availability, HA)將相對重要,然而OpenStack的HA皆針對管理節點之服務進行保護,對於虛擬機器的維護較不完善,因此本研究提出軟體定義運算叢集(Software-Defined High Availability Cluster, SDHAC)的機制,針對叢集內部之虛擬機器發展一套自動化錯誤偵測與復原機制,透過Libvirt服務的即時偵測以及OpenStack的虛擬機管理服務,確保虛擬機器於運算節點維持正常運行之狀態,使用者不須人工介入處理虛擬機器停擺的問題。 為避免因虛擬機所屬的運算節點發生軟硬體異常,而造成虛擬機器服務停擺,本研究結合IPMI (Intelligent Platform Management Interface)進行偵測復原機制,透過IPMI取得運算節點之感測器資訊,可即時監控運算節點之狀態,若節點狀態異常,本研究將會即時遷移(Live Migrate)虛擬機器,以避免運算節點發生錯誤,並造成虛擬機器服務中斷的情況,若運算節點已無預警發生故障,則將虛擬機器錯誤轉移至叢集中另一正常執行之運算節點,並針對異常運算節點進行偵測復原機制,以提高OpenStack針對虛擬機器之高可用性。 ;In recent years, cloud computing technology has become more mature. Because of its elasticity and manageability, most enterprises decide to deploy their business on their virtualized cloud platform. Compare with deploying date center, cloud platform is more convenient to dynamically adjust and effectively manage computing source. With the open source cloud platform, OpenStack, is constantly released a better version. It has gradually become one of the choices for enterprises to build their private cloud computing platform. Because enterprises deploy their business on cloud platform to serve their clients, and those services are provided by virtual machines. In order to keep those services running, high availability(HA) for the cloud platform will be relatively important. However, the HA mechanism of OpenStack is only for those services of controller node. It is incomplete for virtual machine protection, therefore this study proposes Software-Defined High Availability Cluster(SDHAC) mechanism to automatically detect HA virtual machines and recover their failure. The detection mechanism uses libvirt API to real-time monitor virtual machine events, and the recovery mechanism use OpenStack API to recover virtual machine failure. Let virtual machines keep running, users don’t need to fix virtual machines failure by themselves. In order to avoid virtual machines abnormalities which are caused by hardware and software problem of computing nodes. This study combined with IPMI (Intelligent Platform Management Interface) to detect and recover computing node, and read sensor information. If the sensor information of the nodes is critical, our system will immediately migrate (Live Migrate) those virtual machines to avoid errors in the computing nodes and cause the virtual machine services to be interrupted. If the computing node occur no excepted failure, HA virtual machines are failovered to another normal computing node in the HA cluster and recover abnormal computing node to improve OpenStack′s high availability for the virtual machine.