基於 Airflow工作流程管理系統之加速 ML訓練工作錯誤復原機制

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：107

、訪客IP：18.189.186.5

姓名

洪仁傑(Ren-Jie Hong) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

基於 Airflow工作流程管理系統之加速 ML訓練工作錯誤復原機制
(Efficient Fault Recovery Mechanism for ML Training Task Based on Airflow)

相關論文

★ 以伸展樹為基礎的Android Binder Driver	★ 應用增量式學習於多種農作物判釋之研究
★ 應用分類重建學習偵測航照圖幅中的新穎坵塊	★ 用於輔助工業零件辨識之尺寸估算系統
★ 使用無紋理之3D CAD工業零件模型結合長度檢測實現細粒度真實工業零件影像分類	★ 一個建立在平行工作系統上的動態全球計算平台
★ 用權重參照計數演算法執行主動物件垃圾收集	★ 一個動態負載平衡之最大可能性估算計算架構
★ 利用多項系統負載資訊進行動態P2P系統重組的策略研究	★ 基於Hadoop系統的雲端應用程式特徵擷取與計算監測架構
★ 適用於大型動態分散式系統的調適性計算模型	★ 一個提供彈性虛擬資料中心的雲端服務平台
★ 雲端彈性虛擬機房服務平台之資源控管中心	★ 一個適用於自動供應雲端系統的動態調適計算架構
★ 線性相關工作與非相關工作的探索式排程策略	★ 適用於大資料集高效率的分散式階層分群演算法

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

著雲端與人工智慧技術快速發展，許多企業採用 MLOps 的方法來優化及管理機器學習專案，以減少手動操作所帶來的風險。目前主流的方法中，能使用 Apache Airflow組建容器任務，來達成自動化的 ML Pipeline。Airflow的容錯機制中，在恢復中斷任務的方法，是將任務重新啟動。換言之，任務的「狀態」並不會被保存。然而雲端平台中，難以確保中斷事件不會發生，導致長時間的訓練任務無法受到良好的保護。一旦長時間訓練的模型遭遇任務中斷，會需要耗費大量的時間重新執行，此狀況會造成 Pipeline後續任務的延宕，以及壓縮群集的資源。為了解決上述問題，本研究將目標放在探討 ML pipeline 在Airflow 系統中，執行 Pytorch 訓練任務的狀態保護，在發生中斷後恢復任務狀態。本研究提出結合 Airflow 的容錯機制以及 Pytorch 的檢查點功能，實作出在 GPU 叢集中以檢查點恢復的方法，保護訓練任務的狀態不會因重新啟動而消失。此方法能讓「舊任務的狀態」對應到「重啟的新計算資源」，並通過 Checkpoint Hook Function 來恢復任務狀態。實驗中，訓練任務使用 ResNet18 與 ImageNet-1k 資料集，任務訓練 31 個回合。這項任務在Airflow的七次平均的總執行時間約為 1234.33 分鐘，而使用 Checkpoint Hook Function 則增加15.51分鐘，平均增加約1.25%的時間成本。在不同時間點發生中斷事件，七次平均的總執行時間為 1862.52 分鐘，而使用 Checkpoint Hook Function 則減少592.02 分鐘，平均縮短約31.79% 的執行時間。

摘要(英)

With the rapid development of cloud and artificial intelligence technologies, many enterprises are adopting MLOps approaches to optimize and manage machine learning project, thereby reducing the risks associated with manual operations. Apache Airflow is used to assemble containerized tasks into an automated ML pipeline. However, in Apache Airflow’s fault tolerance mechanism, tasks are directly restarted, meaning the task “state” is not preserved. This situation can lead to inadequate protection for long-duration model training on cloud platform. If a long-duration training model encounters a task interrupt, a significant amount of time is required to re-execute it, causing delays in subsequent pipeline tasks and compressing cluster resource. To address these issues, this study focuses on the protection and recovery of the state of ML pipeline training tasks in Pytorch after interruption within the Airflow system. This research proposes a solution that combine Airflow’s fault tolerance mechanism with Pytorch’s checkpoint. By implementing a checkpoint recovery method in a GPU cluster, the training task’s state is preserved even after a restart. This method allows the system to recognize new computational resource as a restart of the task and recover the task state via a Checkpoint Hook Function. Training tasks using ResNet18 with the ImageNet-1K dataset for 31 epochs. Training task had an average execution time of approximately 1234.33 minutes over five runs. Using the Checkpoint Hook Function added 15.51 minutes, which is about 1.25% of the overhead time. During interruptions at different point, the average execution timer over seven runs was 1862.52 minutes, while using the Checkpoint Hook Function reduced this time by 592.02 minutes, shortening the execution time by an average of approximately 31.79%.

關鍵字(中)

★ MLOps
★ 機器學習管道
★ 錯誤恢復
★ Pytorch
★ Apache Airflow
★ 工作流程管理系統

關鍵字(英)

★ MLOps
★ ML Pipeline
★ Fault Recovery
★ Pytorch
★ Apache Airflow
★ Workflow System

論文目次

摘要 i
Abstract ii
目錄 iii
表目錄 v
圖目錄 v
一、緒論 1
1-1 研究背景 1
1-2 研究動機與目的 4
1-3 貢獻與限制 6
1-3-1 限制 6
1-4 論文架構 7
二、背景知識 8
2-1 Airflow 8
2-2 Kubernetes 12
2-3 MinIO 15
2-4 MLOps 16
2-5 Hook Function 17
2-6 現有的Training Checkpoint 及其研究方向 17
2-6-1 訓練再現性 18
三、相關研究 19
3-1 Airflow Workflow System的容錯機制 19
3-2 手動恢復 Pipeline 任務 22
3-3 使用 CRIU 恢復 Workflow與容器的任務 23
3-4 其他 Workflow的恢復方法 23
3-5 總結比較 25
四、機制與系統架構設計 27
4-1 系統使用情境 27
4-2 系統架構介紹 28
4-3 模型儲存與恢復：狀態一致性 30
4-4 辨識重啟任務的方法：計算工作一致性 32
4-5 系統紀錄與恢復方法：Hook Function 35
4-6 系統流程 38
五、實驗 42
5-1 實驗配置 42
5-1-1 任務定義及系統配置 42
5-1-2 硬體配置 43
5-2 實驗設計與結果分析 44
5-2-1 無中斷完整執行時間之成本比較 45
5-2-2 中斷恢復執行時間比較 46
5-2-3 模型在不同機器上的效能影響 49
5-3 結果分析與統整 52
六、結論 53
6-1 貢獻 53
6-2 未來研究方向 54
參考文獻 55

參考文獻

[1] A. Munteanu. "What is MLOps?" Canonical Ubuntu. [online], Available: https://ubuntu.com/blog/what-is-mlops (accessed Mar. 18, 2024).
[2] G. Symeonidis, E. Nerantzis, A. Kazakis, and G. A. Papakostas, "MLOps-definitions, tools and challenges," in 2022 IEEE 12th Annual Computing and Communication Workshop and Conference (CCWC), Las Vegas, NV, USA, Jan 26-29, 2022: IEEE, pp. 0453-0460.
[3] M. Steidl, M. Felderer, and R. Ramler, “The pipeline for the continuousdevelopment of artificial intelligence models—Current state of researchand practice,” Journal of Systems and Software, vol. 199, Art. no. 111615, May 2023.
[4] Azure. "What are Azure Machine Learning pipelines?" Microsoft Azure. [online], Available: https://learn.microsoft.com/zh-tw/azure/machine-learning/concept-ml-pipelines?view=azureml-api-2 (accessed Jan. 10, 2024).
[5] AWS. "What is MLOps?" Amazon Web Services,. [online], Available: https://aws.amazon.com/what-is/mlops/?nc1=h_ls (accessed Jan. 10, 2024).
[6] L. Faubel, K. Schmid, and H. Eichelberger, "MLOps Challenges in Industry 4.0," SN Computer Science, vol. 4, no. 6, 2023, doi: 10.1007/s42979-023-02282-2.
[7] G. Cloud. "MLOps: Continuous delivery and automation pipelines in machine learning." Google Cloud. [online], Available: https://cloud.google.com/architecture/
mlops-continuous-delivery-and-automation-pipelines-in-machine-learning (accessed Jan. 9, 2024).
[8] Y. Zhou, Y. Yu, and B. Ding, "Towards mlops: A case study of ml pipeline platform," in 2020 International conference on artificial intelligence and computer engineering (ICAICE), Beijing, China, Oct. 23-25, 2020: IEEE, pp. 494-500.
[9] D. Kreuzberger, N. Kühl, and S. Hirschl, "Machine learning operations (mlops): Overview, definition, and architecture," IEEE access, vol. 11, pp. 31866-31879, 2023.
[10] X. Wang et al., "Couler: Unified Machine Learning Workflow Optimization in Cloud," arXiv preprint arXiv:2403.07608, 2024, [online], Available: https://arxiv.org/abs/2403
.07608 (accessed Jan. 15, 2024).
[11] P. K. Mandal. "Recent Cloud Platform Outages 2023." LinkedIn. [online], Available: https://www.linkedin.com/pulse/recent-cloud-platform-outages-2023-pankaj-kumar-mandal/ (accessed Jan. 15, 2024).
[12] W. T. Millward. "The 15 Biggest Cloud Outages Of 2023." CRN Magazine, [online], Available: https://www.crn.com/news/cloud/the-15-biggest-cloud-outages-of-2023 (accessed Jan. 15, 2024).
[13] A. Wongpanich et al., "Training EfficientNets at supercomputer scale: 83% ImageNet top-1 accuracy in one hour," in 2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Portland, OR, USA, June. 17-21, 2021, pp. 947-950.
[14] M. Tan and Q. Le, "Efficientnet: Rethinking model scaling for convolutional neural networks," in International conference on machine learning, Long Beach, USA, June 09-15, 2019: PMLR, pp. 6105-6114.
[15] NVIDIA. "DeepLearningExamples: ResNeXt101-32x4d For PyTorch." github.com. [online], Available: https://github.com/NVIDIA/DeepLearningExamples/blob/master/
PyTorch/Classification/ConvNets/resnext101-32x4d/README.md (accessed May, 24, 2024).
[16] S. Narasimhan. "NVIDIA Clocks World’s Fastest BERT Training Time and Largest Transformer Based Model, Paving Path For Advanced Conversational AI." NVIDIA. [online], Available: https://developer.nvidia.com/blog/training-bert-with-gpus/ (accessed May, 24, 2024).
[17] D. Narayanan et al., "Efficient large-scale language model training on gpu clusters using megatron-lm," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, USA, Nov. 14-19, 2021, pp. 1-15.
[18] M.-N. Tran, X. T. Vu, and Y. Kim, "Proactive stateful fault-tolerant system for kubernetes containerized services," IEEE Access, vol. 10, pp. 102181-102194, 2022.
[19] Z. Di, E. Shao, and G. Tan, "High-performance migration tool for live container in a workflow," International Journal of Parallel Programming, vol. 49, pp. 658-670, 2021.
[20] PyTorch " PyTorch Recips: Saving and loading a general checkpoint in PyTorch." Pytorch.org. [online], Available: https://pytorch.org/tutorials/recipes/recipes/saving_
and_loading_a_general_checkpoint.html (accessed Mar. 3, 2024).
[21] Tensorflow. "TensorFlow Core: Training checkpoints." tensorflow.org, [online], Available: https://www.tensorflow.org/guide/checkpoint (accessed Mar. 3, 2024).
[22] S. S. Alahmari, D. B. Goldgof, P. R. Mouton, and L. O. Hall, "Challenges for the repeatability of deep learning models," IEEE Access, vol. 8, pp. 211860-211868, 2020.
[23] E. Rojas, A. N. Kahira, E. Meneses, L. B. Gomez, and R. M. Badia, "A study of checkpointing in large scale training of deep neural networks," arXiv preprint arXiv:2012.00825, 2020, [online], Available: https://arxiv.org/abs/2012.00825.
[24] The Apache Software Foundation. "Airflow Document: Task - Tasks Instances.". Airflow [online], Available: https://airflow.apache.org/docs/apache-airflow/stable/
core-concepts/tasks.html#task-instances (accessed Feb. 8, 2024).
[25] Ray. "Ray Document: Task Fault Tolerance." [online], Available: https://docs.ray.io
/en/latest/ray-core/fault_tolerance/tasks.html (accessed Mar. 4, 2024).
[26] S. Zhuang, S. Wang, E. Liang, Y. Cheng, and I. Stoica, "{ExoFlow}: A universal workflow system for {Exactly-Once}{DAGs}," in 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23), BOSTON, MA, USA, July. 10-12, 2023, pp. 269-286.
[27] G. Agha, Actors: a model of concurrent computation in distributed systems. MIT press, 1986.
[28] H. Foidl, V. Golendukhina, R. Ramler, and M. Felderer, "Data pipeline quality: Influencing factors, root causes of data-related issues, and processing problem areas for developers," Journal of Systems and Software, vol. 207, Art. no. 111855, 2024.
[29] The Apache Software Foundation. "Best practices for orchestrating MLOps pipelines with Airflow." Airflow [online], Available: https://docs.astronomer.io/learn/airflow-mlops (accessed Mar. 5, 2024).
[30] The Apache Software Foundation. "Airflow History." Airflow. [online], Available: https://airflow.apache.org/docs/apache-airflow/stable/project.html#history (accessed Apr. 11, 2024).
[31] M. Beauchemin. "Use Apache Airflow (incubating) to author workflows as directed acyclic graphs (DAGs) of tasks." airbnb.io. [online], Available: https://airbnb.io/projects/airflow/ (accessed Apr. 10, 2024).
[32] 陳鴻嘉：〈【LINE TECHPULSE 2023】精彩回顧與議程資訊〉，2024年5月1日，取自https://engineering.linecorp.com/zh-hant/blog/line-techpulse-2023-report。
[33] 翁芊儒：〈Line臺灣首度揭露以MLOps概念打造的ML協作平臺，透過整合開發工具減少溝通協作成本，來加速ML應用落地〉，2024年5月1日，取自https://www.ithome.com.tw/news/141774。
[34] PingCAP. "Workflow Scheduler - Ranking." ossinsight.io. [online], Available: https://ossinsight.io/collections/workflow-scheduler/?monthly-rankings=issues (accessed May. 6, 2024).
[35] The Apache Software Foundation. "Airflow Documentation: Kubernetes Executor." [online], Airflow. Available: https://airflow.apache.org/docs/apache-airflow/stable/core
-concepts/executor/kubernetes.html (accessed Apr. 2, 2024).
[36] M. Barry et al., "StreamMLOps: Operationalizing Online Learning for Big Data Streaming & Real-Time Applications," in 2023 IEEE 39th International Conference on Data Engineering (ICDE), Anaheim, California, USA, Apr. 3-7, 2023: IEEE, pp. 3508-3521.
[37] D. Panchal, I. Baran, D. Musgrove, and D. Lu, "MLOps: Automatic, Zero-touch and Reusable Machine Learning Training and Serving Pipelines," in 2023 IEEE International Conference on Internet of Things and Intelligence Systems (IoTaIS), Bali, Indonesia, Nov. 28-30, 2023: IEEE, pp. 175-181.
[38] kubernetes. "Kubernetes Documentation." The Linux Foundation. [online], Available: https://kubernetes.io/docs/home/ (accessed June, 09, 2024).
[39] Tony. "K8s — How Does a Pod Acquire an IP Address?" medium.com. [online], Available: https://tonylixu.medium.com/k8s-how-does-a-pod-acquire-an-ip-address-06bef6f50288 (accessed May, 22, 2024).
[40] 邱宏瑋：〈CNI - Flannel - IP 管理篇〉，HWCHIU 學習筆記，2024年5月22日，取自https://www.hwchiu.com/docs/2019/iThome_Challenge/cni-flannel-ii。
[41] MinIO. "MinIO Documentation." Minio.com. [online], Available: https://min.io/docs/minio/linux/operations/concepts.html (accessed May, 23, 2024).
[42] J. Bampton. "Apache OpenDAL." The Apache Software Foundation. https://opendal.apache.org/docs/overview (accessed May, 23, 2024).
[43] G. Cloud. "What is MLOps?" Google Cloud [online], Available: https://cloud.google.com/discover/what-is-mlops#section-6 (accessed June, 05, 2024).
[44] C. Kim, G.-Y. Kim, and S. Kim, "A Microservice-based MLOps Platform for Efficient Development of AI Services in an Edge-Cloud Environment," in 2023 14th International Conference on Information and Communication Technology Convergence (ICTC), Korea, Oct.11-13, 2023: IEEE, pp. 1507-1509.
[45] J. Lopez, L. Babun, H. Aksu, and A. S. Uluagac, "A survey on function and system call hooking approaches," Journal of Hardware and Systems Security, vol. 1, pp. 114-136, 2017.
[46] A. Mohanta, A. Saldanha, A. Mohanta, and A. Saldanha, "Code injection, process hollowing, and API hooking," in Malware Analysis and Detection Engineering: A Comprehensive Approach to Detect and Analyze Modern Malware, pp. 267-329, 2020.
[47] M. Kahlhofer, P. Kern, S. Henning, and S. Rass, "Benchmarking Function Hook Latency in Cloud-Native Environments," arXiv preprint arXiv:2310.12702, 2023.
[48] P. J. Guo and D. Engler, "{CDE}: Using System Call Interposition to Automatically Create Portable Software Packages," in 2011 USENIX Annual Technical Conference (USENIX ATC 11), USA, June. 14-17, 2011.
[49] kubernetes. "Container Lifecycle Hooks." kubernetes. [online], Available: https://kubernetes.io/docs/concepts/containers/container-lifecycle-hooks/ (accessed June, 05, 2024).
[50] angular. "Component Lifecycle." angular. [online], Available: https://v17.angular.io/guide/lifecycle-hooks (accessed June, 05, 2024).
[51] keras. "Model Checkpoint." keras.io. [online], Available: https://keras.io/api/callbacks/
model_checkpoint/ (accessed June, 10, 2024).
[52] J. Mohan, A. Phanishayee, and V. Chidambaram, "{CheckFreq}: Frequent,{Fine-Grained}{DNN} Checkpointing," in 19th USENIX Conference on File and Storage Technologies (FAST 21), Santa Clara, CA, USA, Feb. 23-252021, pp. 203-216.
[53] A. Wood et al., "Towards Fast Crash-Consistent Cluster Checkpointing," in 2022 IEEE High Performance Extreme Computing Conference (HPEC), Waltham, MA, USA,Sep. 19-23, 2022: IEEE, pp. 1-8.
[54] B. Nicolae, J. Li, J. M. Wozniak, G. Bosilca, M. Dorier, and F. Cappello, "Deepfreeze: Towards scalable asynchronous checkpointing of deep learning models," in 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID), Melbourne, Australia, May. 11-14, 2020: IEEE, pp. 172-181.
[55] A. Agrawal et al., "DynaQuant: Compressing Deep Learning Training Checkpoints via Dynamic Quantization," arXiv preprint arXiv:2306.11800, 2023, [online], Available: https://doi.org/10.48550/arXiv.2306.11800.
[56] H. E. Plesser, "Reproducibility vs. replicability: a brief history of a confused terminology," Frontiers in neuroinformatics, vol. 11, Art. no. 76, 2018.
[57] N. Ferro and D. Kelly, "SIGIR initiative to implement ACM artifact review and badging," in ACM SIGIR Forum, 2018, vol. 52, no. 1: ACM New York, NY, USA, pp. 4-10.
[58] S. S. Khan, B. Palmer, C. Edelmaier, and H. M. Aktulga, "OpenRAND: A performance portable, reproducible random number generation library for parallel computations," SoftwareX, vol. 27, p. 101773, 2024.
[59] N. Akhila, C. U. Kumari, K. Swathi, T. Padma, and N. M. Rao, "Performance analysis of pseudo random bit generator using modified dual-coupled linear congruential generator," in 2021 International Conference on Intelligent Technologies (CONIT), Hubbali, Karnataka, India, Jun. 25-27, 2021: IEEE, pp. 1-5.
[60] M. Kadam, S. V. Siddamal, and S. Annigeri, "D, esign and Implementation of chaotic nondeterministic random seed-based Hybrid True Random Number Generator" in 2020 24th International Symposium on VLSI Design and Test (VDAT), India, Jul 23-25, 2020: IEEE, pp. 1-5.
[61] Pytorch. "Reproducibility." The Linux Foundation. [online], Available: https://pytorch.org/docs/stable/notes/randomness.html (accessed June, 12, 2024).
[62] A. R. Munappy, J. Bosch, and H. H. Olsson, "Data pipeline management in practice: Challenges and opportunities," in Product-Focused Software Process Improvement: 21st International Conference, PROFES 2020, Turin, Italy, November 25–27, 2020, Proceedings 21, Turin, Italy, November 25–27, 2020: Springer, pp. 168-184.
[63] W. Xiao et al., "Gandiva: Introspective cluster scheduling for deep learning," in 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), CA, USA, Oct. 8-10, 2018, pp. 595-610.
[64] A. Arjona, P. G. López, J. Sampé, A. Slominski, and L. Villard, "Triggerflow: Trigger-based orchestration of serverless workflows," Future Generation Computer Systems, vol. 124, pp. 215-229, 2021.

指導教授

王尉任(Wei-Jen Wang)

審核日期

2024-7-24

推文