參考文獻 |
[1] A. Munteanu. "What is MLOps?" Canonical Ubuntu. [online], Available: https://ubuntu.com/blog/what-is-mlops (accessed Mar. 18, 2024).
[2] G. Symeonidis, E. Nerantzis, A. Kazakis, and G. A. Papakostas, "MLOps-definitions, tools and challenges," in 2022 IEEE 12th Annual Computing and Communication Workshop and Conference (CCWC), Las Vegas, NV, USA, Jan 26-29, 2022: IEEE, pp. 0453-0460.
[3] M. Steidl, M. Felderer, and R. Ramler, “The pipeline for the continuousdevelopment of artificial intelligence models—Current state of researchand practice,” Journal of Systems and Software, vol. 199, Art. no. 111615, May 2023.
[4] Azure. "What are Azure Machine Learning pipelines?" Microsoft Azure. [online], Available: https://learn.microsoft.com/zh-tw/azure/machine-learning/concept-ml-pipelines?view=azureml-api-2 (accessed Jan. 10, 2024).
[5] AWS. "What is MLOps?" Amazon Web Services,. [online], Available: https://aws.amazon.com/what-is/mlops/?nc1=h_ls (accessed Jan. 10, 2024).
[6] L. Faubel, K. Schmid, and H. Eichelberger, "MLOps Challenges in Industry 4.0," SN Computer Science, vol. 4, no. 6, 2023, doi: 10.1007/s42979-023-02282-2.
[7] G. Cloud. "MLOps: Continuous delivery and automation pipelines in machine learning." Google Cloud. [online], Available: https://cloud.google.com/architecture/
mlops-continuous-delivery-and-automation-pipelines-in-machine-learning (accessed Jan. 9, 2024).
[8] Y. Zhou, Y. Yu, and B. Ding, "Towards mlops: A case study of ml pipeline platform," in 2020 International conference on artificial intelligence and computer engineering (ICAICE), Beijing, China, Oct. 23-25, 2020: IEEE, pp. 494-500.
[9] D. Kreuzberger, N. Kühl, and S. Hirschl, "Machine learning operations (mlops): Overview, definition, and architecture," IEEE access, vol. 11, pp. 31866-31879, 2023.
[10] X. Wang et al., "Couler: Unified Machine Learning Workflow Optimization in Cloud," arXiv preprint arXiv:2403.07608, 2024, [online], Available: https://arxiv.org/abs/2403
.07608 (accessed Jan. 15, 2024).
[11] P. K. Mandal. "Recent Cloud Platform Outages 2023." LinkedIn. [online], Available: https://www.linkedin.com/pulse/recent-cloud-platform-outages-2023-pankaj-kumar-mandal/ (accessed Jan. 15, 2024).
[12] W. T. Millward. "The 15 Biggest Cloud Outages Of 2023." CRN Magazine, [online], Available: https://www.crn.com/news/cloud/the-15-biggest-cloud-outages-of-2023 (accessed Jan. 15, 2024).
[13] A. Wongpanich et al., "Training EfficientNets at supercomputer scale: 83% ImageNet top-1 accuracy in one hour," in 2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Portland, OR, USA, June. 17-21, 2021, pp. 947-950.
[14] M. Tan and Q. Le, "Efficientnet: Rethinking model scaling for convolutional neural networks," in International conference on machine learning, Long Beach, USA, June 09-15, 2019: PMLR, pp. 6105-6114.
[15] NVIDIA. "DeepLearningExamples: ResNeXt101-32x4d For PyTorch." github.com. [online], Available: https://github.com/NVIDIA/DeepLearningExamples/blob/master/
PyTorch/Classification/ConvNets/resnext101-32x4d/README.md (accessed May, 24, 2024).
[16] S. Narasimhan. "NVIDIA Clocks World’s Fastest BERT Training Time and Largest Transformer Based Model, Paving Path For Advanced Conversational AI." NVIDIA. [online], Available: https://developer.nvidia.com/blog/training-bert-with-gpus/ (accessed May, 24, 2024).
[17] D. Narayanan et al., "Efficient large-scale language model training on gpu clusters using megatron-lm," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, USA, Nov. 14-19, 2021, pp. 1-15.
[18] M.-N. Tran, X. T. Vu, and Y. Kim, "Proactive stateful fault-tolerant system for kubernetes containerized services," IEEE Access, vol. 10, pp. 102181-102194, 2022.
[19] Z. Di, E. Shao, and G. Tan, "High-performance migration tool for live container in a workflow," International Journal of Parallel Programming, vol. 49, pp. 658-670, 2021.
[20] PyTorch " PyTorch Recips: Saving and loading a general checkpoint in PyTorch." Pytorch.org. [online], Available: https://pytorch.org/tutorials/recipes/recipes/saving_
and_loading_a_general_checkpoint.html (accessed Mar. 3, 2024).
[21] Tensorflow. "TensorFlow Core: Training checkpoints." tensorflow.org, [online], Available: https://www.tensorflow.org/guide/checkpoint (accessed Mar. 3, 2024).
[22] S. S. Alahmari, D. B. Goldgof, P. R. Mouton, and L. O. Hall, "Challenges for the repeatability of deep learning models," IEEE Access, vol. 8, pp. 211860-211868, 2020.
[23] E. Rojas, A. N. Kahira, E. Meneses, L. B. Gomez, and R. M. Badia, "A study of checkpointing in large scale training of deep neural networks," arXiv preprint arXiv:2012.00825, 2020, [online], Available: https://arxiv.org/abs/2012.00825.
[24] The Apache Software Foundation. "Airflow Document: Task - Tasks Instances.". Airflow [online], Available: https://airflow.apache.org/docs/apache-airflow/stable/
core-concepts/tasks.html#task-instances (accessed Feb. 8, 2024).
[25] Ray. "Ray Document: Task Fault Tolerance." [online], Available: https://docs.ray.io
/en/latest/ray-core/fault_tolerance/tasks.html (accessed Mar. 4, 2024).
[26] S. Zhuang, S. Wang, E. Liang, Y. Cheng, and I. Stoica, "{ExoFlow}: A universal workflow system for {Exactly-Once}{DAGs}," in 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23), BOSTON, MA, USA, July. 10-12, 2023, pp. 269-286.
[27] G. Agha, Actors: a model of concurrent computation in distributed systems. MIT press, 1986.
[28] H. Foidl, V. Golendukhina, R. Ramler, and M. Felderer, "Data pipeline quality: Influencing factors, root causes of data-related issues, and processing problem areas for developers," Journal of Systems and Software, vol. 207, Art. no. 111855, 2024.
[29] The Apache Software Foundation. "Best practices for orchestrating MLOps pipelines with Airflow." Airflow [online], Available: https://docs.astronomer.io/learn/airflow-mlops (accessed Mar. 5, 2024).
[30] The Apache Software Foundation. "Airflow History." Airflow. [online], Available: https://airflow.apache.org/docs/apache-airflow/stable/project.html#history (accessed Apr. 11, 2024).
[31] M. Beauchemin. "Use Apache Airflow (incubating) to author workflows as directed acyclic graphs (DAGs) of tasks." airbnb.io. [online], Available: https://airbnb.io/projects/airflow/ (accessed Apr. 10, 2024).
[32] 陳鴻嘉:〈【LINE TECHPULSE 2023】精彩回顧與議程資訊〉,2024年5月1日,取自https://engineering.linecorp.com/zh-hant/blog/line-techpulse-2023-report。
[33] 翁芊儒:〈Line臺灣首度揭露以MLOps概念打造的ML協作平臺,透過整合開發工具減少溝通協作成本,來加速ML應用落地〉,2024年5月1日,取自https://www.ithome.com.tw/news/141774。
[34] PingCAP. "Workflow Scheduler - Ranking." ossinsight.io. [online], Available: https://ossinsight.io/collections/workflow-scheduler/?monthly-rankings=issues (accessed May. 6, 2024).
[35] The Apache Software Foundation. "Airflow Documentation: Kubernetes Executor." [online], Airflow. Available: https://airflow.apache.org/docs/apache-airflow/stable/core
-concepts/executor/kubernetes.html (accessed Apr. 2, 2024).
[36] M. Barry et al., "StreamMLOps: Operationalizing Online Learning for Big Data Streaming & Real-Time Applications," in 2023 IEEE 39th International Conference on Data Engineering (ICDE), Anaheim, California, USA, Apr. 3-7, 2023: IEEE, pp. 3508-3521.
[37] D. Panchal, I. Baran, D. Musgrove, and D. Lu, "MLOps: Automatic, Zero-touch and Reusable Machine Learning Training and Serving Pipelines," in 2023 IEEE International Conference on Internet of Things and Intelligence Systems (IoTaIS), Bali, Indonesia, Nov. 28-30, 2023: IEEE, pp. 175-181.
[38] kubernetes. "Kubernetes Documentation." The Linux Foundation. [online], Available: https://kubernetes.io/docs/home/ (accessed June, 09, 2024).
[39] Tony. "K8s — How Does a Pod Acquire an IP Address?" medium.com. [online], Available: https://tonylixu.medium.com/k8s-how-does-a-pod-acquire-an-ip-address-06bef6f50288 (accessed May, 22, 2024).
[40] 邱宏瑋:〈CNI - Flannel - IP 管理篇〉,HWCHIU 學習筆記,2024年5月22日,取自https://www.hwchiu.com/docs/2019/iThome_Challenge/cni-flannel-ii。
[41] MinIO. "MinIO Documentation." Minio.com. [online], Available: https://min.io/docs/minio/linux/operations/concepts.html (accessed May, 23, 2024).
[42] J. Bampton. "Apache OpenDAL." The Apache Software Foundation. https://opendal.apache.org/docs/overview (accessed May, 23, 2024).
[43] G. Cloud. "What is MLOps?" Google Cloud [online], Available: https://cloud.google.com/discover/what-is-mlops#section-6 (accessed June, 05, 2024).
[44] C. Kim, G.-Y. Kim, and S. Kim, "A Microservice-based MLOps Platform for Efficient Development of AI Services in an Edge-Cloud Environment," in 2023 14th International Conference on Information and Communication Technology Convergence (ICTC), Korea, Oct.11-13, 2023: IEEE, pp. 1507-1509.
[45] J. Lopez, L. Babun, H. Aksu, and A. S. Uluagac, "A survey on function and system call hooking approaches," Journal of Hardware and Systems Security, vol. 1, pp. 114-136, 2017.
[46] A. Mohanta, A. Saldanha, A. Mohanta, and A. Saldanha, "Code injection, process hollowing, and API hooking," in Malware Analysis and Detection Engineering: A Comprehensive Approach to Detect and Analyze Modern Malware, pp. 267-329, 2020.
[47] M. Kahlhofer, P. Kern, S. Henning, and S. Rass, "Benchmarking Function Hook Latency in Cloud-Native Environments," arXiv preprint arXiv:2310.12702, 2023.
[48] P. J. Guo and D. Engler, "{CDE}: Using System Call Interposition to Automatically Create Portable Software Packages," in 2011 USENIX Annual Technical Conference (USENIX ATC 11), USA, June. 14-17, 2011.
[49] kubernetes. "Container Lifecycle Hooks." kubernetes. [online], Available: https://kubernetes.io/docs/concepts/containers/container-lifecycle-hooks/ (accessed June, 05, 2024).
[50] angular. "Component Lifecycle." angular. [online], Available: https://v17.angular.io/guide/lifecycle-hooks (accessed June, 05, 2024).
[51] keras. "Model Checkpoint." keras.io. [online], Available: https://keras.io/api/callbacks/
model_checkpoint/ (accessed June, 10, 2024).
[52] J. Mohan, A. Phanishayee, and V. Chidambaram, "{CheckFreq}: Frequent,{Fine-Grained}{DNN} Checkpointing," in 19th USENIX Conference on File and Storage Technologies (FAST 21), Santa Clara, CA, USA, Feb. 23-252021, pp. 203-216.
[53] A. Wood et al., "Towards Fast Crash-Consistent Cluster Checkpointing," in 2022 IEEE High Performance Extreme Computing Conference (HPEC), Waltham, MA, USA,Sep. 19-23, 2022: IEEE, pp. 1-8.
[54] B. Nicolae, J. Li, J. M. Wozniak, G. Bosilca, M. Dorier, and F. Cappello, "Deepfreeze: Towards scalable asynchronous checkpointing of deep learning models," in 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID), Melbourne, Australia, May. 11-14, 2020: IEEE, pp. 172-181.
[55] A. Agrawal et al., "DynaQuant: Compressing Deep Learning Training Checkpoints via Dynamic Quantization," arXiv preprint arXiv:2306.11800, 2023, [online], Available: https://doi.org/10.48550/arXiv.2306.11800.
[56] H. E. Plesser, "Reproducibility vs. replicability: a brief history of a confused terminology," Frontiers in neuroinformatics, vol. 11, Art. no. 76, 2018.
[57] N. Ferro and D. Kelly, "SIGIR initiative to implement ACM artifact review and badging," in ACM SIGIR Forum, 2018, vol. 52, no. 1: ACM New York, NY, USA, pp. 4-10.
[58] S. S. Khan, B. Palmer, C. Edelmaier, and H. M. Aktulga, "OpenRAND: A performance portable, reproducible random number generation library for parallel computations," SoftwareX, vol. 27, p. 101773, 2024.
[59] N. Akhila, C. U. Kumari, K. Swathi, T. Padma, and N. M. Rao, "Performance analysis of pseudo random bit generator using modified dual-coupled linear congruential generator," in 2021 International Conference on Intelligent Technologies (CONIT), Hubbali, Karnataka, India, Jun. 25-27, 2021: IEEE, pp. 1-5.
[60] M. Kadam, S. V. Siddamal, and S. Annigeri, "D, esign and Implementation of chaotic nondeterministic random seed-based Hybrid True Random Number Generator" in 2020 24th International Symposium on VLSI Design and Test (VDAT), India, Jul 23-25, 2020: IEEE, pp. 1-5.
[61] Pytorch. "Reproducibility." The Linux Foundation. [online], Available: https://pytorch.org/docs/stable/notes/randomness.html (accessed June, 12, 2024).
[62] A. R. Munappy, J. Bosch, and H. H. Olsson, "Data pipeline management in practice: Challenges and opportunities," in Product-Focused Software Process Improvement: 21st International Conference, PROFES 2020, Turin, Italy, November 25–27, 2020, Proceedings 21, Turin, Italy, November 25–27, 2020: Springer, pp. 168-184.
[63] W. Xiao et al., "Gandiva: Introspective cluster scheduling for deep learning," in 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), CA, USA, Oct. 8-10, 2018, pp. 595-610.
[64] A. Arjona, P. G. López, J. Sampé, A. Slominski, and L. Villard, "Triggerflow: Trigger-based orchestration of serverless workflows," Future Generation Computer Systems, vol. 124, pp. 215-229, 2021. |