基於 Kubernetes 與 OpenFaaS 之分散式多源數據工作流處理系統;Distributed workflow processing for multisource data streams based on Kubernetes and OpenFaaS

NCUIR > College of Electrical Engineering & Computer Science > Graduate Institute of Computer Science and Information Engineering > Electronic Thesis & Dissertation > Item 987654321/89739

Please use this identifier to cite or link to this item: http://ir.lib.ncu.edu.tw/handle/987654321/89739

Title:	基於 Kubernetes 與 OpenFaaS 之分散式多源數據工作流處理系統;Distributed workflow processing for multisource data streams based on Kubernetes and OpenFaaS
Authors:	鄭博謙;Cheng, Po-Chien
Contributors:	資訊工程學系
Keywords:	微服務;Kafka;ETL;Kubernetes;Functions as a Services;Microservice;Kafka;ETL;Kubernetes;Functions as a Services
Date:	2022-07-07
Issue Date:	2022-10-04 11:58:04 (UTC+8)
Publisher:	國立中央大學
Abstract:	現今大數據時代，收集多樣化的數據已經成為企業資源重要的一環，其中 ETL 是一種數據資料取得、處理、分析常見的作法，將不同數據源收集的資料經過一連串資料前處理、聚合、過濾，最後將數據保存下來以供後續分析使用。由於數據源不同需使用特定程式語言建立每個資料流，隨著系統範圍及規模越來越大，造成數據流的可維護、管理性降低。近年來容器化技術發展迅速，許多軟體服務開始以微服務的形式部署運行，藉由編排容器工具提供跨主機叢集的自動部署、擴展，確保服務在可用節點上運行，但當微服務未被使用時會占用系統資源，也因此逐漸演變出新的無服務器概念。因此，本研究設計出一套分散式多源數據之工作流處理機制，以 Kafka 作為多種輸入、輸出數據源資料暫存平台。透過工作流管理器定義處理數據流所需執行的條件與步驟，其中將步驟封裝成無服務器函數 ( FaaS ) 部署在 Kubernetes 中以提供給工作流管理器調用，透過監控 FaaS Gateway 流量來自動擴展服務以處理流量高峰，若服務未使用時可自動縮減數量以降低系統資源使用量，最終將處理過的數據儲存至外部數據倉庫儲存。;In today′s big data era, collecting diverse data has become an important part of enterprise resources. ETL is a common practice for data acquisition, processing, and analysis. Finally, the data is saved for subsequent analysis. Due to the different data sources, each data stream needs to be established using a specific programming language. As the scope and scale of the system increase, the maintainability and management of the data stream are reduced. In recent years, containerization technology has developed rapidly, and many software services have begun to be deployed and run in the form of microservices. The orchestration container tool provides automatic deployment and expansion across host clusters to ensure that services run on available nodes, but when microservices are not used. It will take up system resources, and thus gradually evolve into a new serverless concept. Therefore, this research designs a set of distributed multi-source data workflow processing mechanism, using Kafka as a staging platform for various input and output data sources. Define the conditions and steps required to process the data flow through the workflow manager, in which the steps are encapsulated into serverless functions (FaaS) and deployed in Kubernetes to provide the workflow manager to call, and automatically expand the service by monitoring the FaaS Gateway traffic In order to deal with traffic peaks, if the service is not in use, the number can be automatically reduced to reduce system resource usage, and finally the processed data is stored in an external data warehouse for storage.
Appears in Collections:	[Graduate Institute of Computer Science and Information Engineering] Electronic Thesis & Dissertation

Files in This Item:

File	Description	Size	Format
index.html		0Kb	HTML	65	View/Open

社群 sharing

Loading...