摘要: | 根據國外研究報告,欲提高企業競爭性,可在公司內部積極引進數位轉型,就學生從事的產業電信業而言,數位轉型是有助於提高客戶數及降低單位成本,公司也在這兩年大力推動雲端 (Azure)、AI及大數據 (Databricks平台)等技術落地,我們單位在這方面的具體作法,是全面導入倉儲資料上雲,對比於地端的資料倉儲系統架構 (Teradata),機器租借/授權使用費用昂貴,且容量缺乏彈性擴充,雲端架構則有建置容易,可依需求用量擴充等優勢。 在導入過程,需依據選用的 PaaS平台 (Databricks),學習開發新的ETL Job並依據雲端特性,重新規劃其工作排程,預計每天會有數以千隻ETL Job被運作,需在公司制定的KPI要求的時間跑完,最後要能做到成本/效益最佳化,雲端工作排程這個議題,也就變成極具有挑戰性,在排程過程中,考慮平台的特性,需針對資源分配 (VM)、工作優先級 (Job) 等多個因素,思考並做出最佳解決方案。 為了解決這個問題,本研究評估了不同的組合方法,後續進行了實驗和比較,希望能收集不同組合方法在不同情況下的表現,進而找出最適合的方法進行工作排程優化。 實驗分成四個部分: 實驗一,自訂CNN驗證了不同的資料集,矩陣大小不一樣 實驗二,評估了同一資料集,三種不同的DNN (CNN、resNet-18、mobileNet) 實驗三,resNet-18驗證不同的資料集,矩陣大小不一樣 實驗四,實驗三基礎上加入Dueling DQN,驗證不同的資料集,矩陣大小不一樣 使用workflow和VM的相關資源特徵值,可減少狀態空間的複雜度,使得後續強化學習演算法更加高效和穩定。本研究使用的DDDQN演算法是增強式學習的一種,Agent使用端到端的強化學習直接從高維度感官輸入中學習成功的策略,是將深度卷積網路跟強化學習結合起來的架構。而傳統的工作排程問題中,常是使用啟發式算法或是基於規則的方法進行優化,這些方法需要手動設計決策規則,無法適應不斷變化的生產環境。;According to international research reports, to enhance corporate competitiveness, businesses can actively promote digital transformation within the company. In the case of the telecommunications industry that students are engaged in, digital transformation helps increase the number of customers and reduce unit costs. In the past two years, the company has been vigorously promoting the implementation of technologies such as cloud (Azure), AI, and big data (Databricks platform). Our unit′s specific approach is to fully migrate warehouse data to the cloud. Compared to the on-premises data warehouse system architecture (Teradata), which has high machine rental/license usage fees and lacks flexible capacity expansion, the cloud architecture offers advantages such as easy deployment and scalability according to demand. During the implementation process, it is necessary to learn and develop new ETL Jobs based on the selected PaaS platform (Databricks) and redesign the job scheduling according to cloud characteristics. It is estimated that thousands of ETL Jobs will be run daily, which need to be completed within the company′s established KPI requirements. The ultimate goal is to optimize cost-effectiveness, making cloud job scheduling a highly challenging topic. During the scheduling process, considering the platform′s characteristics, the best solution must be found by focusing on multiple factors such as resource allocation (VM) and job priority (Job). To solve this problem, this study evaluated different combination methods, followed by experiments and comparisons, hoping to collect the performance of different combination methods under various scenarios and find the most suitable method for optimizing job scheduling. The experiments were divided into four parts: Experiment 1: Custom CNN validated different datasets with different matrix sizes. Experiment 2: Evaluated three different DNNs (CNN, ResNet-18, MobileNet) on the same dataset. Experiment 3: ResNet-18 validated different datasets with different matrix sizes. Experiment 4: Added Dueling DQN, validating different datasets with different matrix sizes. By using Job and Cluster′s resource-related feature values, this approach can reduce the complexity of the state space, making the deep reinforcement learning algorithm in the second stage more efficient and stable. The DDDQN (Dueling Double Deep Q Network) algorithm is a type of reinforcement learning. The agent uses end-to-end reinforcement learning to learn successful strategies directly from high-dimensional sensory inputs, combining deep convolutional networks with reinforcement learning architecture. In traditional job scheduling problems, heuristic algorithms or rule-based methods are often used for optimization. These methods often require manual design of decision rules and cannot adapt to the ever-changing production environment. |