中大機構典藏-NCU Institutional Repository-提供博碩士論文、考古題、期刊論文、研究計畫等下載:Item 987654321/95594
English  |  正體中文  |  简体中文  |  全文笔数/总笔数 : 80990/80990 (100%)
造访人次 : 42875988      在线人数 : 866
RC Version 7.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.
搜寻范围 查询小技巧:
  • 您可在西文检索词汇前后加上"双引号",以获取较精准的检索结果
  • 若欲以作者姓名搜寻,建议至进阶搜寻限定作者字段,可获得较完整数据
  • 进阶搜寻


    jsp.display-item.identifier=請使用永久網址來引用或連結此文件: http://ir.lib.ncu.edu.tw/handle/987654321/95594


    题名: 基於 Airflow工作流程管理系統之加速 ML訓練工作錯誤復原機制;Efficient Fault Recovery Mechanism for ML Training Task Based on Airflow
    作者: 洪仁傑;Hong, Ren-Jie
    贡献者: 資訊工程學系
    关键词: MLOps;機器學習管道;錯誤恢復;Pytorch;Apache Airflow;工作流程管理系統;MLOps;ML Pipeline;Fault Recovery;Pytorch;Apache Airflow;Workflow System
    日期: 2024-07-24
    上传时间: 2024-10-09 17:05:14 (UTC+8)
    出版者: 國立中央大學
    摘要: 著雲端與人工智慧技術快速發展,許多企業採用 MLOps 的方法來優化及管理機器學習專案,以減少手動操作所帶來的風險。目前主流的方法中,能使用 Apache Airflow組建容器任務,來達成自動化的 ML Pipeline。Airflow的容錯機制中,在恢復中斷任務的方法,是將任務重新啟動。換言之,任務的「狀態」並不會被保存。然而雲端平台中,難以確保中斷事件不會發生,導致長時間的訓練任務無法受到良好的保護。一旦長時間訓練的模型遭遇任務中斷,會需要耗費大量的時間重新執行,此狀況會造成 Pipeline後續任務的延宕,以及壓縮群集的資源。為了解決上述問題,本研究將目標放在探討 ML pipeline 在Airflow 系統中,執行 Pytorch 訓練任務的狀態保護,在發生中斷後恢復任務狀態。本研究提出結合 Airflow 的容錯機制以及 Pytorch 的檢查點功能,實作出在 GPU 叢集中以檢查點恢復的方法,保護訓練任務的狀態不會因重新啟動而消失。此方法能讓「舊任務的狀態」對應到「重啟的新計算資源」,並通過 Checkpoint Hook Function 來恢復任務狀態。實驗中,訓練任務使用 ResNet18 與 ImageNet-1k 資料集,任務訓練 31 個回合。這項任務在Airflow的七次平均的總執行時間約為 1234.33 分鐘,而使用 Checkpoint Hook Function 則增加15.51分鐘,平均增加約1.25%的時間成本。在不同時間點發生中斷事件,七次平均的總執行時間為 1862.52 分鐘,而使用 Checkpoint Hook Function 則減少592.02 分鐘,平均縮短約31.79% 的執行時間。;With the rapid development of cloud and artificial intelligence technologies, many enterprises are adopting MLOps approaches to optimize and manage machine learning project, thereby reducing the risks associated with manual operations. Apache Airflow is used to assemble containerized tasks into an automated ML pipeline. However, in Apache Airflow’s fault tolerance mechanism, tasks are directly restarted, meaning the task “state” is not preserved. This situation can lead to inadequate protection for long-duration model training on cloud platform. If a long-duration training model encounters a task interrupt, a significant amount of time is required to re-execute it, causing delays in subsequent pipeline tasks and compressing cluster resource. To address these issues, this study focuses on the protection and recovery of the state of ML pipeline training tasks in Pytorch after interruption within the Airflow system. This research proposes a solution that combine Airflow’s fault tolerance mechanism with Pytorch’s checkpoint. By implementing a checkpoint recovery method in a GPU cluster, the training task’s state is preserved even after a restart. This method allows the system to recognize new computational resource as a restart of the task and recover the task state via a Checkpoint Hook Function. Training tasks using ResNet18 with the ImageNet-1K dataset for 31 epochs. Training task had an average execution time of approximately 1234.33 minutes over five runs. Using the Checkpoint Hook Function added 15.51 minutes, which is about 1.25% of the overhead time. During interruptions at different point, the average execution timer over seven runs was 1862.52 minutes, while using the Checkpoint Hook Function reduced this time by 592.02 minutes, shortening the execution time by an average of approximately 31.79%.
    显示于类别:[資訊工程研究所] 博碩士論文

    文件中的档案:

    档案 描述 大小格式浏览次数
    index.html0KbHTML45检视/开启


    在NCUIR中所有的数据项都受到原著作权保护.

    社群 sharing

    ::: Copyright National Central University. | 國立中央大學圖書館版權所有 | 收藏本站 | 設為首頁 | 最佳瀏覽畫面: 1024*768 | 建站日期:8-24-2009 :::
    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library IR team Copyright ©   - 隱私權政策聲明