混合式前處理於類別不平衡問題之研究 - 結合機器學習與生成對抗網路;A Hybrid Preprocessing Approach for the Class Imbalance Problem - Using Machine Learning and Generative Adversarial Network

NCU Institutional Repository > 管理學院 > 資訊管理研究所 > 博碩士論文 > Item 987654321/86683

請使用永久網址來引用或連結此文件: http://ir.lib.ncu.edu.tw/handle/987654321/86683

題名:	混合式前處理於類別不平衡問題之研究 - 結合機器學習與生成對抗網路;A Hybrid Preprocessing Approach for the Class Imbalance Problem - Using Machine Learning and Generative Adversarial Network
作者:	林倩;Lin, Cian
貢獻者:	資訊管理學系
關鍵詞:	類別不平衡;生成對抗網路;分類;深度學習;樣本選取;class imbalance;generating adversarial networks;classification;deep learning;instance selection
日期:	2021-09-08
上傳時間:	2021-12-07 13:06:46 (UTC+8)
出版者:	國立中央大學
摘要:	類別不平衡（Class Imbalance）是指當資料集中某一類樣本的數量遠大於另一類樣本的數量時，資料產生偏態分布（Skewed Distribution）的現象。傳統分類器為了追求高分類正確率，建立出的預測模型將會傾向多數類樣本（Majority Class），而忽略具有高價值的小類樣本（Minority Class），使得分類器在訓練時產生不良的分類規則。因此，類別不平衡為現今機器學習領域中具有挑戰性的問題，於真實世界愈來愈常見，例如詐騙信用卡檢測、醫療診斷、資訊檢索、文本分類等等。另外，由於具高價值的少數類資料不易蒐集，資源往往掌握在大公司或相關領域的產業，例如醫療、金融等。另一方面，適當地移除雜訊可以有效提高準確性，我們必須透過一些方法來辨識哪些資料要刪除，哪些資料需要被保留下來作為代表性的樣本。為了解決上述問題，本研究使用KEEL 網站上44個類別不平衡資料集，在前處理（Data preprocessing）的步驟中採用資料層級（Data level）方法，對訓練集進行重採樣（Resampling）來重新分配資料分佈。我們選用三種樣本選取方法（IB3, DROP3, GA）進行資料清理、三種過採樣法（SMOTE, Vanilla GAN, CTGAN）進行小類樣本生成，且本研究針對Vanilla GAN的架構做修改以生成結構化資料，並組合搭配以上算法，與過往文獻中的方法進行比較進而找出最佳前處理組合，分析在不同分類模型下的表現。除了探討不同的生成資料方式，為了深入瞭解不同面向對不平衡資料的影響，我們觀察上述組合與類別不平衡率（Imbalance Ratio）、訓練資料集大小（Training data size）之間的關係來理解類別不平衡的問題。經由實驗結果發現，樣本選取（IB3）搭配過採樣法（Vanilla GAN）為最有效解決類別不平衡問題的組合，且採用混合式前處理方法時，當資料經由樣本選取清理雜訊後，使用以深度神經網絡為基礎GAN能夠比基於傳統線性插值法的SMOTE生成效果更佳的結構化資料。 ;The class skewed distribution occurs when the number of examples that represent one class is much lower than the ones of the other classes. To maximize the accuracy, the traditional classifiers tend to misclassify most samples in the minority class into the majority class. This phenomenon limits the construction of effective classifiers for the precious minority class. Hence, the class imbalance problem is an important issue in machine learning. This problem occurs in many real-world applications, such as fault diagnosis, medical diagnosis, and face recognition. Additionally, since it is not easy to collect minority data, resources are often held in large companies or related industries, such as medical and financial institutions. On the other hand, properly removing noise can effectively improve accuracy. Therefore, we use some methods to identify which data should be deleted and which data should be retained as a representative sample. In order to solve the above problems, our experiments using 44 class imbalance datasets from KEEL to build classification models. In the step of data preprocessing, the data level method is used to resampling the training set to redistribute the data distribution. We use three instance selection methods (IB3, DROP3, GA) for data cleaning and three over-sampling methods (SMOTE, Vanilla GAN, CTGAN) for minority samples generation. Moreover, our research is based on the Vanilla GAN architecture to modify the structure to generate data. In addition to comparing the methods in the previous literature, we not only find the best pre-processing combination but also analyze the performance under different classification models. According to the experimental results, the most effective solution combination is instance selection (IB3) with oversampling (Vanilla GAN). For the hybrid pre-processing method, after the data is cleaned up by instance selection, using GAN (based on the deep neural network) can generate structured data with better results than based on SMOTE (traditional linear interpolation).
顯示於類別:	[資訊管理研究所] 博碩士論文

文件中的檔案:

檔案	描述	大小	格式	瀏覽次數
index.html		0Kb	HTML	71	檢視/開啟

在NCUIR中所有的資料項目都受到原著作權保護.

社群 sharing

資料載入中.....