子計畫三：支援異質性智慧計算雲端平台之服務保護機制與自動復原;High Availability Protection for Cloud Services and Applications on Heterogeneous Intelligent Computing Platform

NCU Institutional Repository > 資訊電機學院 > 資訊工程學系 > 研究計畫 > Item 987654321/78647

請使用永久網址來引用或連結此文件: http://ir.lib.ncu.edu.tw/handle/987654321/78647

題名:	子計畫三：支援異質性智慧計算雲端平台之服務保護機制與自動復原;High Availability Protection for Cloud Services and Applications on Heterogeneous Intelligent Computing Platform
作者:	王尉任;陳奕明;梁德容
貢獻者:	國立中央大學資訊工程系
關鍵詞:	雲端計算平台;智慧型平台管理介面;高可靠度;容錯轉移;虛擬機器;異質性;多層級系統;錯誤根本原因偵測;Cloud Platform;Intelligent Platform Management Interface;High Availability;Failover;Virtual Machine;Heterogeneity;Multi-Layered System;Failure Root Cause Detection
日期:	2018-12-19
上傳時間:	2018-12-20 13:41:50 (UTC+8)
出版者:	科技部
摘要:	商業上使用的智慧計算雲端平台通常會利用智慧型平台管理介面 (IPMI) 來監控實體機器層級的運作，並在實體機器上安裝多層中介軟體(如虛擬機器)支援計算需求，同時降低雲端平台的管理難度。這些平台不一定會是由同質性架構的計算節點所構成，有可能是由異質性架構的計算節點(如一般伺服器與GPU伺服器)構成，來解決不同的計算需求問題。商業上使用的智慧計算雲端平台非常重視使用者應用程式的高可靠度 (High Availability)，也就是透過軟體自動偵測到各種執行之錯誤後進行快速回復，以保護使用者應用程式之執行，因此可以大幅降低使用者應用程式的服務中斷時間 (Service Downtime) 並減少人為介入。然而現今的高可靠度解決方案僅使用Heartbeating機制，將實體機器與作業系統視為黑箱，只針對目標軟體來進行錯誤偵測。這種方法會有Heartbeat等待時間過度敏感的問題，可能會造成誤判或者偵測時間過久。本計畫主要是針對異質性智慧計算雲端平台的高可靠度進行理論探討與技術研發，並針對其多階層特性與異質性提供高可靠度的解決方案。本計畫為三年期的整合型計畫「行動應用之惡意行為分析技術研發及平台建置研究」中的子計畫(子計畫三)「支援異質性智慧計算雲端平台之服務保護機制與自動復原」。其它兩個子計畫主要目的在提供惡意程式偵測與使用者認證服務，而這兩者都需要使用具高可靠度的異質性智慧計算雲端平台(使用GPU機器或虛擬機器)進行運算。本計畫在第一年中會利用雲端平台多階層特性所造成的子系統錯誤相依性，去設計新的錯誤根本原因偵測流程與理論，預期可大幅降低對Heartbeating方法的依賴以及因子系統層級過多而造成診斷時間過長的問題。第二年我們會將技術延伸到虛擬機器層級，研發虛擬機層級偵測與回復方法研發與系統錯誤自我診斷功能研發。第三年我們會研發偵錯流程客製化功能研發與GPU錯誤偵測與回復方法研發來解決異質性的問題，這種方法主要是基於異質運算節點仍有大部分相似的層級，只要抽換部分的錯誤種類即可完成兩錯誤集間的切換，而且不影響階層式的特性。 ;A heterogeneous intelligent cloud platform refers to a multi-layered, virtualized cloud system running on top of a set of heterogeneous compute nodes, where each node supports Intelligent Platform Management Interface (IPMI) for remote monitoring and control. When applying the platform to industrial or commercial use, the platform must support High Availability (HA) to protect users' services/applications from unexpected failures. Unfortunately, existing HA solutions neither consider the multi-layer feature nor the heterogeneity feature. Instead, they usually treat each compute node as a single-layered system, and rely on a heartbeating mechanism to detect the liveness of critical services/applications. The major problem of this approach is that, the heartbeating mechanism cannot identify the root cause of a failure. In addition, it is very sensitive to a predefined heartbeat waiting time, which may result in long detection time or wrong decision. In this proposal, we aim to develop a novel HA theory for critical cloud services/applications running on a heterogeneous intelligent cloud platform. We plan to operate this project in a three-year period. In the first year, we will develop a new HA theory for multi-layered platforms, and design an efficient method to detect the root cause of a failure on a multi-layered system. We will also implement the design on OpenStack. In the second year, we will apply our new HA design to the virtualization layer of the platforms. In the third year, we will develop a configurable detection and recovery mechanism for heterogeneous multi-layered platforms, in particular for the compute nodes with or without GPUs.
關聯:	財團法人國家實驗研究院科技政策研究與資訊中心
顯示於類別:	[資訊工程學系] 研究計畫

文件中的檔案:

檔案	描述	大小	格式	瀏覽次數
index.html		0Kb	HTML	222	檢視/開啟

在NCUIR中所有的資料項目都受到原著作權保護.

社群 sharing

資料載入中.....