博碩士論文 111522119 完整後設資料紀錄

DC 欄位 語言
DC.contributor資訊工程學系zh_TW
DC.creator洪仁傑zh_TW
DC.creatorRen-Jie Hongen_US
dc.date.accessioned2024-7-24T07:39:07Z
dc.date.available2024-7-24T07:39:07Z
dc.date.issued2024
dc.identifier.urihttp://ir.lib.ncu.edu.tw:444/thesis/view_etd.asp?URN=111522119
dc.contributor.department資訊工程學系zh_TW
DC.description國立中央大學zh_TW
DC.descriptionNational Central Universityen_US
dc.description.abstract著雲端與人工智慧技術快速發展,許多企業採用 MLOps 的方法來優化及管理機器學習專案,以減少手動操作所帶來的風險。目前主流的方法中,能使用 Apache Airflow組建容器任務,來達成自動化的 ML Pipeline。Airflow的容錯機制中,在恢復中斷任務的方法,是將任務重新啟動。換言之,任務的「狀態」並不會被保存。然而雲端平台中,難以確保中斷事件不會發生,導致長時間的訓練任務無法受到良好的保護。一旦長時間訓練的模型遭遇任務中斷,會需要耗費大量的時間重新執行,此狀況會造成 Pipeline後續任務的延宕,以及壓縮群集的資源。為了解決上述問題,本研究將目標放在探討 ML pipeline 在Airflow 系統中,執行 Pytorch 訓練任務的狀態保護,在發生中斷後恢復任務狀態。本研究提出結合 Airflow 的容錯機制以及 Pytorch 的檢查點功能,實作出在 GPU 叢集中以檢查點恢復的方法,保護訓練任務的狀態不會因重新啟動而消失。此方法能讓「舊任務的狀態」對應到「重啟的新計算資源」,並通過 Checkpoint Hook Function 來恢復任務狀態。實驗中,訓練任務使用 ResNet18 與 ImageNet-1k 資料集,任務訓練 31 個回合。這項任務在Airflow的七次平均的總執行時間約為 1234.33 分鐘,而使用 Checkpoint Hook Function 則增加15.51分鐘,平均增加約1.25%的時間成本。在不同時間點發生中斷事件,七次平均的總執行時間為 1862.52 分鐘,而使用 Checkpoint Hook Function 則減少592.02 分鐘,平均縮短約31.79% 的執行時間。zh_TW
dc.description.abstractWith the rapid development of cloud and artificial intelligence technologies, many enterprises are adopting MLOps approaches to optimize and manage machine learning project, thereby reducing the risks associated with manual operations. Apache Airflow is used to assemble containerized tasks into an automated ML pipeline. However, in Apache Airflow’s fault tolerance mechanism, tasks are directly restarted, meaning the task “state” is not preserved. This situation can lead to inadequate protection for long-duration model training on cloud platform. If a long-duration training model encounters a task interrupt, a significant amount of time is required to re-execute it, causing delays in subsequent pipeline tasks and compressing cluster resource. To address these issues, this study focuses on the protection and recovery of the state of ML pipeline training tasks in Pytorch after interruption within the Airflow system. This research proposes a solution that combine Airflow’s fault tolerance mechanism with Pytorch’s checkpoint. By implementing a checkpoint recovery method in a GPU cluster, the training task’s state is preserved even after a restart. This method allows the system to recognize new computational resource as a restart of the task and recover the task state via a Checkpoint Hook Function. Training tasks using ResNet18 with the ImageNet-1K dataset for 31 epochs. Training task had an average execution time of approximately 1234.33 minutes over five runs. Using the Checkpoint Hook Function added 15.51 minutes, which is about 1.25% of the overhead time. During interruptions at different point, the average execution timer over seven runs was 1862.52 minutes, while using the Checkpoint Hook Function reduced this time by 592.02 minutes, shortening the execution time by an average of approximately 31.79%.en_US
DC.subjectMLOpszh_TW
DC.subject機器學習管道zh_TW
DC.subject錯誤恢復zh_TW
DC.subjectPytorchzh_TW
DC.subjectApache Airflowzh_TW
DC.subject工作流程管理系統zh_TW
DC.subjectMLOpsen_US
DC.subjectML Pipelineen_US
DC.subjectFault Recoveryen_US
DC.subjectPytorchen_US
DC.subjectApache Airflowen_US
DC.subjectWorkflow Systemen_US
DC.title基於 Airflow工作流程管理系統之加速 ML訓練工作錯誤復原機制zh_TW
dc.language.isozh-TWzh-TW
DC.titleEfficient Fault Recovery Mechanism for ML Training Task Based on Airflowen_US
DC.type博碩士論文zh_TW
DC.typethesisen_US
DC.publisherNational Central Universityen_US

若有論文相關問題,請聯絡國立中央大學圖書館推廣服務組 TEL:(03)422-7151轉57407,或E-mail聯絡  - 隱私權政策聲明