摘要: | 隨著邊緣運算與人工智慧應用的持續發展,AI 推論服務逐漸被廣泛應用於即時互動場景,如擴增實境眼鏡、智慧監控與穿戴式裝置等。然而,在邊緣環境中,由於資源受限且不可預期的節點故障或網路問題時常發生,現行如Kubernetes等平台所提供的容錯與自動擴展機制,往往無法即時回應突發事件,導致服務中斷或畫面卡頓,進而影響終端設備之使用體驗。本研究針對上述問題,提出一套具備高可用性與即時性保障的AI 推論服務系統。該系統運行於Kubernetes之上,結合NVIDIA GPU的Time-slicing機制,設計訂閱式的服務存取方式,並透過系統狀態偵測、故障處理、頻率調整與服務部署等模組協同運作,以動態調整Agent傳送頻率與AI服務部署策略,確保終端設備於異常情境下仍能維持不低於FPS下限的服務品質。 本研究透過四種邊緣常見情境設計實驗,包含節點故障、網路隔離、推論服務錯誤與終端數量暴增,並與 Kubernetes原生機制進行比較。同時,為驗證本系統在更高服務品質需求下的表現,亦額外進行提高FPS門檻值之實驗,以評估其穩定性與適應性。實驗結果顯示,所提出之系統在節點故障情境下,在10分鐘的觀測時間內可將骨架辨識服務可用時間由原生機制之41.17%提升至91.67%;在服務錯誤情境下,可將可用時間由30%提升至92.33%;在高併發負載下,則由6% 提升至79.83%。在FPS門檻提高後,本系統之整體可用性未明顯下降,主要差異體現在可同時服務之Agent數量。具體而言,在節點故障與網路隔離兩個情境中,骨架辨識服務因需維持較高的FPS品質門檻,推論資源受限,因此由原先可支援三個Agent減少為兩個;而在Agent短時間大量增加的情境中,手勢辨識服務則因部署策略考量與負載限制,最多僅能同時支援四個Agent,相較原本的七個減少了三個。儘管可服務人數減少,本系統仍能透過頻率調整策略,維持高品質的服務穩定性。整體而言,本研究所提出之方法可顯著提升AI推論服務於邊緣環境中的容錯能力與服務品質。 ;With the continuous advancement of edge computing and artificial intelligence applications, AI inference services are increasingly being applied in real-time interactive scenarios, such as augmented reality glasses, intelligent surveillance, and wearable devices. However, in edge environments, due to limited resources and frequent unexpected node failures or network issues, existing platforms like Kubernetes often fail to respond to such incidents in a timely manner. This results in service interruptions or frame drops, ultimately affecting the user experience of terminal devices. To address these challenges, this study proposes a high-availability and real-time guaranteed AI inference service system. Built on top of Kubernetes, the system incorporates NVIDIA GPU Time-slicing, and adopts a subscription-based service access mechanism. Through coordinated operation of modules including system state monitoring, fault handling, frequency adjustment, and service deployment, the system dynamically adjusts agent transmission rates and AI service deployment strategies to ensure that terminal devices can maintain an FPS above the minimum threshold during abnormal conditions. This study designs four experimental scenarios commonly encountered in edge environments—node failure, network isolation, inference service crash, and a surge in terminal device connections—and compares the proposed method with the native Kubernetes mechanism. Additionally, to validate the system′s performance under stricter quality-of-service (QoS) requirements, we conduct further experiments with increased FPS thresholds to evaluate its stability and adaptability. Experimental results show that in the node failure scenario, the proposed system increases the available time of the pose estimation service from 41.17% (under Kubernetes) to 91.67% within a 10-minute observation window; in the service crash scenario, from 30% to 92.33%; and under high-concurrency conditions, from 6% to 79.83%. After increasing the FPS thresholds, the system′s overall availability did not significantly decline. The main difference lies in the number of agents that can be served simultaneously. Specifically, in the node failure and network isolation scenarios, the pose estimation service—constrained by the need to meet higher FPS thresholds and limited inference resources—reduced its support from three agents to two. In the high agent surge scenario, the gesture recognition service, due to deployment strategy and load limitations, could only support four agents concurrently, a decrease from the original seven. Despite the reduction in serviceable agents, the system maintained high-quality service stability through frequency adjustment strategies. Overall, the proposed method significantly enhances the fault tolerance and service quality of AI inference in edge environments. |