隨著雲計算技術的廣泛使用,停機等因素造成的服務中斷規模與損失也跟著提升。近年來,各公司積極導入容錯技術以確保服務的可用性(Availability),從而降低損失。在本論文中,我們提出了一個可靠且高效率的高可靠(High Availability)機制以提升雲計算系統的可用性,該機制包含了在線(Online)故障偵測與復原方法。我們的研究共包含三個主要部分: (1)高效率的在線故障偵測機制、(2)自適應(Adaptive)機制以提高提出的容錯機制的可靠度、(3)虛擬機引導時間(VM boot time)預測模型。其中,提出的故障偵測機制是用於實時地偵測系統的運行狀況(Liveness)以提高系統的可用性。由於雲計算系統可視為一個多階層系統而且階層之間具有線性的相依性,所以我們透過整合多個現有的檢測器以及一個特別設計的故障模型來達到快速偵測故障的目的。然而,由於該機制依賴於多個檢測器的檢測結果,因此當檢測器發生故障或不被支援時,我們便無法應用該機制。為了改進該機制的可靠度與擴展其應用情境,我們透過重構故障模型,並使容錯機制基於該新模型自動地調整故障偵測與復原機制來應對上述情境。最後,在復原方法方面,我們採用最常見的復原方法: 將虛擬機重啟。由於虛擬機有可能重啟失敗,若容錯機制無法快速檢測該情況並立即採取相應措施,則會延長服務停機時間並降低可用性。因此,我們提出了虛擬機引導時間預測模型,我們可以根據該模型的預測結果來判斷虛擬機是否正確地被重啟。在該研究中,我們提出了一個基於規則的預測模型以及一個基於機器學習的預測模型。根據實驗結果,提出的故障偵測機制比現有的系統層心跳方法減少70.3%的檢測時間。此外,提升的自適應機制成功地使容錯機制正常運行在發生檢測器故障的情境中。最後,兩種提出的預測模型的準確度(Accuracy)皆達到90%以上。由於到基於機器學習的預測模型所需的資料量,我們建議在小型雲系統或資料中心裡使用該模型,而基於規則的預測模型則可應用在大型雲系統或資料中心。值得一提的是,我們透過實驗發現某些情況下利用預測模型來決定虛擬機放置位置可以減少恢復時間,所以未來可繼續朝這方向進行研究。;With the widespread use of cloud computing technology, service interruptions and losses due to faults such as power outages have increased. Many companies have introduced high availability (HA) mechanisms to ensure the continuous operation of services. This dissertation proposes a reliable and efficient HA mechanism to improve the availability of cloud computing systems, which includes an online liveness fault detection and recovery method. There are three research topics: (i) efficient online liveness fault detection mechanism, (ii) adaptive mechanism to improve the reliability of the proposed HA mechanism, and (iii) VM boot time prediction model. Since the cloud system can be abstracted as a multi-layer system with linear layer dependency, the proposed fault detection mechanism integrates multiple existing detectors and uses a designed fault model to improve its efficiency. In addition, since the proposed HA mechanism relies on multiple detectors, the mechanism cannot be used in cases where a detector fails or is not supported. To improve the reliability of the mechanism, we propose an adaptive mechanism, which reconstructs the fault model and then automatically adjusts the fault detection and recovery mechanisms based on the new model. Since the common recovery method such as virtual machine (VM) restart may fail, we propose a rule-based model and a machine learning-based model to predict VM boot time, which can be used to evaluate VM restart status. According to the experimental results, the detection time of the proposed fault detection mechanism is 70.3% shorter than that of the system layer heartbeating. The proposed adaptive mechanism successfully enables the HA mechanism to work well even when some detectors are disabled. Furthermore, both proposed prediction models achieve over 90% accuracy. Based on our experimental results, we can reduce recovery time by generating VM placement strategies using the prediction model, which may be a future research topic.