混合式前處理於類別不平衡問題之研究 - 結合機器學習與生成對抗網路

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：37

、訪客IP：13.58.79.186

姓名

林倩(Cian Lin) 查詢紙本館藏

畢業系所

資訊管理學系

論文名稱

混合式前處理於類別不平衡問題之研究 - 結合機器學習與生成對抗網路
(A Hybrid Preprocessing Approach for the Class Imbalance Problem - Using Machine Learning and Generative Adversarial Network)

相關論文

★ 單一類別分類方法於不平衡資料集－搭配遺漏值填補和樣本選取方法	★ 單分類方法於類別不平衡資料集之研究－結合特徵選取與集成式學習
★ 應用文字探勘技術於股價預測：探討傳統機器學習及深度學習技術與不同財經新聞來源之關係	★ 單一與並列式集成特徵選取方法於多分類類別不平衡問題之研究

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 (2026-9-2以後開放)

摘要(中)

類別不平衡（Class Imbalance）是指當資料集中某一類樣本的數量遠大於另一類樣本的數量時，資料產生偏態分布（Skewed Distribution）的現象。傳統分類器為了追求高分類正確率，建立出的預測模型將會傾向多數類樣本（Majority Class），而忽略具有高價值的小類樣本（Minority Class），使得分類器在訓練時產生不良的分類規則。因此，類別不平衡為現今機器學習領域中具有挑戰性的問題，於真實世界愈來愈常見，例如詐騙信用卡檢測、醫療診斷、資訊檢索、文本分類等等。
另外，由於具高價值的少數類資料不易蒐集，資源往往掌握在大公司或相關領域的產業，例如醫療、金融等。另一方面，適當地移除雜訊可以有效提高準確性，我們必須透過一些方法來辨識哪些資料要刪除，哪些資料需要被保留下來作為代表性的樣本。
為了解決上述問題，本研究使用KEEL 網站上44個類別不平衡資料集，在前處理（Data preprocessing）的步驟中採用資料層級（Data level）方法，對訓練集進行重採樣（Resampling）來重新分配資料分佈。我們選用三種樣本選取方法（IB3, DROP3, GA）進行資料清理、三種過採樣法（SMOTE, Vanilla GAN, CTGAN）進行小類樣本生成，且本研究針對Vanilla GAN的架構做修改以生成結構化資料，並組合搭配以上算法，與過往文獻中的方法進行比較進而找出最佳前處理組合，分析在不同分類模型下的表現。
除了探討不同的生成資料方式，為了深入瞭解不同面向對不平衡資料的影響，我們觀察上述組合與類別不平衡率（Imbalance Ratio）、訓練資料集大小（Training data size）之間的關係來理解類別不平衡的問題。
經由實驗結果發現，樣本選取（IB3）搭配過採樣法（Vanilla GAN）為最有效解決類別不平衡問題的組合，且採用混合式前處理方法時，當資料經由樣本選取清理雜訊後，使用以深度神經網絡為基礎GAN能夠比基於傳統線性插值法的SMOTE生成效果更佳的結構化資料。

摘要(英)

The class skewed distribution occurs when the number of examples that represent one class is much lower than the ones of the other classes. To maximize the accuracy, the traditional classifiers tend to misclassify most samples in the minority class into the majority class. This phenomenon limits the construction of effective classifiers for the precious minority class. Hence, the class imbalance problem is an important issue in machine learning. This problem occurs in many real-world applications, such as fault diagnosis, medical diagnosis, and face recognition.
Additionally, since it is not easy to collect minority data, resources are often held in large companies or related industries, such as medical and financial institutions. On the other hand, properly removing noise can effectively improve accuracy. Therefore, we use some methods to identify which data should be deleted and which data should be retained as a representative sample.
In order to solve the above problems, our experiments using 44 class imbalance datasets from KEEL to build classification models. In the step of data preprocessing, the data level method is used to resampling the training set to redistribute the data distribution. We use three instance selection methods (IB3, DROP3, GA) for data cleaning and three over-sampling methods (SMOTE, Vanilla GAN, CTGAN) for minority samples generation. Moreover, our research is based on the Vanilla GAN architecture to modify the structure to generate data. In addition to comparing the methods in the previous literature, we not only find the best pre-processing combination but also analyze the performance under different classification models.
According to the experimental results, the most effective solution combination is instance selection (IB3) with oversampling (Vanilla GAN). For the hybrid pre-processing method, after the data is cleaned up by instance selection, using GAN (based on the deep neural network) can generate structured data with better results than based on SMOTE (traditional linear interpolation).

關鍵字(中)

★ 類別不平衡
★ 生成對抗網路
★ 分類
★ 深度學習
★ 樣本選取

關鍵字(英)

★ class imbalance
★ generating adversarial networks
★ classification
★ deep learning
★ instance selection

論文目次

摘要 i
Abstract ii
目錄 iii
圖目錄 v
表目錄 vii
第一章緒論 1
1.1 研究背景 1
1.2 研究動機 3
1.3 研究目的 5
1.4 研究架構 6
第二章文獻探討 7
2.1 類別不平衡問題之處理 7
2.1.1 資料層級（Data level） 8
2.1.2 演算法層級（Algorithm level） 13
2.1.3 成本敏感法（Cost-sensitive method） 14
2.2 樣本選取（Instance Selection） 15
2.3 生成對抗網路（Generative Adversarial Network） 19
2.3.1 Vanilla GAN 21
2.3.2 cGAN 31
2.3.3 WGAN 32
2.3.4 WGAN-GP 32
2.3.5 DCGAN 33
第三章研究方法 34
3.1 Baseline（C4.5/ C4.5+SMOTE） 35
3.2 實驗一架構 36
3.2.1 小節一：過採樣 36
3.2.2 小節二：樣本選取 38
3.2.3 小節三：混合方法 38
3.3 實驗參數設定 40
3.3.1 樣本學習演算法（IB3） 40
3.3.2 遞減式縮減最佳化程序（DROP3） 40
3.3.3 基因演算法（GA） 42
3.3.4 SMOTE-NC 43
3.3.5 Vanilla GAN 45
3.4 優化技巧 51
3.4.1 批量標準化（Batch Normalization） 51
3.4.2 優化器（Optimizer） 51
3.5 分類器參數設定 54
3.6 實驗方法驗證及評估指標 56
3.6.1 方法驗證 56
3.6.2 實驗評估指標 57
第四章實驗結果 60
4.1 實驗環境準備 60
4.2 實驗資料集 60
4.3 Baseline實驗結果 63
4.4 實驗一結果 65
4.4.1 小節一：過採樣 65
4.4.2 小節二：樣本選取 69
4.4.3 混合方法 73
4.5 資料不平衡率 77
第五章結論 79
5.1 研究貢獻 79
5.2 未來研究方向與建議 80
參考文獻 81

參考文獻

[1] C.Kleissner, “Data mining for the enterprise,” Proc. Hawaii Int. Conf. Syst. Sci., vol. 7, no. c, pp. 295–304, 1998, doi: 10.1109/hicss.1998.649224.
[2] D.Hand, H.Mannila, and P.Smyth, Principles of Data Mining Cambridge, vol. 2001. 2001.
[3] M.Burri, “Understanding the Implications of Big Data and Big Data Analytics for Competition Law,” in New Developments in Competition Law and Economics, Springer, 2019, pp. 241–263.
[4] A. M.Hormozi and S.Giles, “Data mining: A competitive weapon for banking and retail industries,” Inf. Syst. Manag., vol. 21, no. 2, pp. 62–71, 2004.
[5] P. K.Chan, W.Fan, A. L.Prodromidis, and S. J.Stolfo, “Distributed Data Mining in Credit Card Fraud Detection,” IEEE Intell. Syst. Their Appl., vol. 14, no. 6, pp. 67–74, 1999, doi: 10.1109/5254.809570.
[6] J.Burez andD.denPoel, “Handling class imbalance in customer churn prediction,” Expert Syst. Appl., vol. 36, no. 3, pp. 4626–4636, 2009.
[7] U.Fayyad, G.Piatetsky-Shapiro, andP.Smyth, “From data mining to knowledge discovery in databases,” AI Mag., vol. 17, no. 3, p. 37, 1996.
[8] M.Miller, “Visual Analytics of Spatio-Temporal Event Predictions: Investigating Causes for Urban Heat Islands,” 2018.
[9] S.Garcia, J.Luengo, and F.Herrera, Data preprocessing in data mining, vol. 72. Springer, 2015.
[10] C.Phua, D.Alahakoon, and V.Lee, “Minority report in fraud detection,” ACM SIGKDD Explor. Newsl., vol. 6, no. 1, pp. 50–59, 2004, doi: 10.1145/1007730.1007738.
[11] M. A.Mazurowski, P. A.Habas, J. M.Zurada, J. Y.Lo, J. A.Baker, and G. D.Tourassi, “Training neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance,” Neural Networks, vol. 21, no. 2–3, pp. 427–436, 2008, doi: 10.1016/j.neunet.2007.12.031.
[12] D. D.Lewis and J.Catlett, “Heterogeneous uncertainty sampling for supervised learning,” in Machine learning proceedings 1994, Elsevier, 1994, pp. 148–156.
[13] Y.Li, G.Sun, and Y.Zhu, “Data imbalance problem in text classification,” Proc. - 3rd Int. Symp. Inf. Process. ISIP 2010, pp. 301–305, 2010, doi: 10.1109/ISIP.2010.47.
[14] R. C.Prati, G. E. A. P. A.Batista, and M. C.Monard, “Class imbalances versus class overlapping: An analysis of a learning system behavior,” Lect. Notes Artif. Intell. (Subseries Lect. Notes Comput. Sci., vol. 2972, pp. 312–321, 2004, doi: 10.1007/978-3-540-24694-7_32.
[15] M.Alibeigi, S.Hashemi, and A.Hamzeh, “DBFS: An effective Density Based Feature Selection scheme for small sample size and high dimensional imbalanced data sets,” Data Knowl. Eng., vol. 81–82, pp. 67–103, 2012, doi: 10.1016/j.datak.2012.08.001.
[16] T.Jo and N.Japkowicz, “Class imbalances versus small disjuncts,” ACM SIGKDD Explor. Newsl., vol. 6, no. 1, pp. 40–49, Jun.2004, doi: 10.1145/1007730.1007737.
[17] N.VChawla, N.Japkowicz, and A.Ko, “Editorial: Special Issue on Learning from Imbalanced Data Sets,” ACM SIGKDD Explor. Newsl., vol. 6, no. 1.
[18] M.Galar, A.Fernandez, E.Barrenechea, H.Bustince, and F.Herrera, “A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches,” IEEE Trans. Syst. Man, Cybern. Part C (Applications Rev., vol. 42, no. 4, pp. 463–484, 2011.
[19] N.Japkowicz and S.Stephen, “The class imbalance problem: A systematic study,” Intell. Data Anal., vol. 6, no. 5, pp. 429–449, Jan.2002, doi: 10.3233/ida-2002-6504.
[20] J.Stefanowski and S.Wilk, “Selective pre-processing of imbalanced data for improving classification performance,” in International Conference on Data Warehousing and Knowledge Discovery, 2008, pp. 283–292.
[21] N.V.Chawla, D. A.Cieslak, L. O.Hall, and A.Joshi, “Automatically countering imbalance and its empirical relationship to cost,” Data Min. Knowl. Discov., vol. 17, no. 2, pp. 225–252, 2008, doi: 10.1007/s10618-008-0087-0.
[22] A.Estabrooks, T.Jo, and N.Japkowicz, “A multiple resampling method for learning from imbalanced data sets,” Comput. Intell., vol. 20, no. 1, pp. 18–36, Feb.2004, doi: 10.1111/j.0824-7935.2004.t01-1-00228.x.
[23] A.Orriols-Puig and E.Bernadó-Mansilla, “Evolutionary rule-based systems for imbalanced data sets,” Soft Comput., vol. 13, no. 3, pp. 213–225, 2009, doi: 10.1007/s00500-008-0319-7.
[24] X.-Y.Liu, J.Wu, and Z.-H.Zhou, “Exploratory undersampling for class-imbalance learning,” IEEE Trans. Syst. Man, Cybern. Part B, vol. 39, no. 2, pp. 539–550, 2008.
[25] N.V.Chawla, K. W.Bowyer, L. O.Hall, and W. P.Kegelmeyer, “SMOTE: Synthetic minority over-sampling technique,” J. Artif. Intell. Res., vol. 16, pp. 321–357, Jun.2002, doi: 10.1613/jair.953.
[26] G. E. A. P. A.Batista, R. C.Prati, and M. C.Monard, “A study of the behavior of several methods for balancing machine learning training data,” ACM SIGKDD Explor. Newsl., vol. 6, no. 1, pp. 20–29, Jun.2004, doi: 10.1145/1007730.1007735.
[27] E.Ramentol, Y.Caballero, R.Bello, and F.Herrera, “SMOTE-RSB*: A hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory,” Knowl. Inf. Syst., vol. 33, no. 2, pp. 245–265, 2012, doi: 10.1007/s10115-011-0465-6.
[28] J. A.Olvera-López, J. A.Carrasco-Ochoa, J. F.Mart’inez-Trinidad, and J.Kittler, “A review of instance selection methods,” Artif. Intell. Rev., vol. 34, no. 2, pp. 133–143, 2010.
[29] W.Liu, Z.Wang, X.Liu, N.Zeng, Y.Liu, and F. E.Alsaadi, “A survey of deep neural network architectures and their applications,” Neurocomputing, vol. 234, pp. 11–26, Apr.2017, doi: 10.1016/j.neucom.2016.12.038.
[30] C.Drummond and R. C.Holte, “C4. 5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling,” in Workshop on learning from imbalanced datasets II, 1998, pp. 295--304.
[31] S.Kotsiantis, D.Kanellopoulos, and P.Pintelas, “Handling imbalanced datasets : A review,” Science (80-. )., vol. 30, no. 1, pp. 25–36, 2006.
[32] M.Kubat and S.Matwin, “Addressing the curse of imbalanced data sets: One-sided sampling,” Proc. Fourteenth Int. Conf. Mach. Learn., pp. 179–186, 1997.
[33] D. L.Wilson, “Asymptotic properties of nearest neighbor rules using edited data,” IEEE Trans. Syst. Man. Cybern., no. 3, pp. 408–421, 1972.
[34] J.Han, J.Pei, and M.Kamber, Data mining: concepts and techniques. Elsevier, 2011.
[35] L.Breiman, “Random forests,” Mach. Learn., vol. 45, no. 1, pp. 5–32, Oct.2001, doi: 10.1023/A:1010933404324.
[36] Y.Freund, R. S.-Icml, and U.1996, “Experiments with a new boosting algorithm,” Citeseer, vol. 96, pp. 148--156, 1996.
[37] N.V.Chawla, A.Lazarevic, L. O.Hall, and K. W.Bowyer, “SMOTEBoost: Improving prediction of the minority class in boosting,” Lect. Notes Artif. Intell. (Subseries Lect. Notes Comput. Sci., vol. 2838, pp. 107–119, 2003, doi: 10.1007/978-3-540-39804-2_12.
[38] D. A.Cieslak, N.VChawla, and A.Striegel, “Combating imbalance in network intrusion datasets.,” in GrC, 2006, pp. 732–737.
[39] R. A.Johnson, N.VChawla, and J. J.Hellmann, “Species distribution modeling and prediction: A class imbalance problem,” in 2012 Conference on Intelligent Data Understanding, 2012, pp. 9–16.
[40] A.Fallahi and S.Jafari, “An Expert System for Detection of Breast Cancer Using Data Preprocessing and Bayesian Network,” Int. J. Adv. Sci. Technol., vol. 34, no. October, pp. 65–70, 2011.
[41] C. F.Tsai, W. C.Lin, Y. H.Hu, and G. T.Yao, “Under-sampling class imbalanced datasets by combining clustering analysis and instance selection,” Inf. Sci. (Ny)., vol. 477, pp. 47–54, 2019, doi: 10.1016/j.ins.2018.10.029.
[42] M.Blachnik and M.Kordos, “Comparison of instance selection and construction methods with various classifiers,” Appl. Sci., vol. 10, no. 11, pp. 1–19, 2020, doi: 10.3390/app10113933.
[43] R.Longadge and S.Dongre, “Class imbalance problem in data mining review,” arXiv Prepr. arXiv1305.1707, 2013.
[44] P. H.-I. transactions on informationtheory and U.1968, “The condensed nearest neighbor rule (Corresp.),” Citeseer, pp. 515–516, 1967.
[45] D. W.Aha, D.Kibler, and M. K.Albert, “Instance-based learning algorithms,” Mach. Learn., vol. 6, no. 1, pp. 37–66, Jan.1991, doi: 10.1007/bf00153759.
[46] N.Jankowski and M.Grochowski, “Comparison of instances seletion algorithms i. algorithms survey,” in International conference on artificial intelligence and soft computing, 2004, pp. 598–603.
[47] I.Goodfellow et al., “Generative adversarial networks,” Commun. ACM, vol. 63, no. 11, pp. 139–144, Oct.2020, doi: 10.1145/3422622.
[48] M.Arjovsky, S.Chintala, and L.Bottou, “Wasserstein generative adversarial networks,” in International conference on machine learning, 2017, pp. 214–223.
[49] I.Gulrajani, F.Ahmed, M.Arjovsky, V.Dumoulin, and A.Courville, “Improved training of wasserstein gans,” arXiv Prepr. arXiv1704.00028, 2017.
[50] M.Mirza and S.Osindero, “Conditional generative adversarial nets,” arXiv Prepr. arXiv1411.1784, 2014.
[51] M. Y.Liu, T.Breuel, and J.Kautz, “Unsupervised image-to-image translation networks,” in Advances in Neural Information Processing Systems, Mar. 2017, vol. 2017-Decem, pp. 701–709.
[52] M.Frid-Adar, E.Klang, M.Amitai, J.Goldberger, and H.Greenspan, “Synthetic data augmentation using GAN for improved liver lesion classification,” Proc. - Int. Symp. Biomed. Imaging, vol. 2018-April, no. Isbi, pp. 289–293, 2018, doi: 10.1109/ISBI.2018.8363576.
[53] M.Frid-Adar, I.Diamant, E.Klang, M.Amitai, J.Goldberger, and H.Greenspan, “GAN-based synthetic medical image augmentation for increased CNN performance in liver lesion classification,” Neurocomputing, vol. 321, pp. 321–331, 2018.
[54] L. A.Gatys, A. S.Ecker, and M.Bethge, “A neural algorithm of artistic style,” arXiv Prepr. arXiv1508.06576, 2015.
[55] P.Isola, J.-Y.Zhu, T.Zhou, and A. A.Efros, “Image-to-image translation with conditional adversarial networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1125–1134.
[56] A.Radford, L.Metz, and S.Chintala, “Unsupervised representation learning with deep convolutional generative adversarial networks,” arXiv Prepr. arXiv1511.06434, 2015.
[57] C.Ledig et al., “Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network,” Proc. IEEE Conf. Comput. Vis. pattern Recognit., pp. 4681–4690, 2017.
[58] L.Yu, W.Zhang, J.Wang, and Y.Yu, “Seqgan: Sequence generative adversarial nets with policy gradient,” in Proceedings of the AAAI conference on artificial intelligence, 2017, vol. 31, no. 1.
[59] S.Reed, Z.Akata, X.Yan, L.Logeswaran, B.Schiele, and H.Lee, “Generative adversarial text to image synthesis,” 33rd Int. Conf. Mach. Learn. ICML 2016, vol. 3, pp. 1681–1690, 2016.
[60] L.Lusa and others, “Evaluation of smote for high-dimensional class-imbalanced microarray data,” in 2012 11th international conference on machine learning and applications, 2012, vol. 2, pp. 89–94.
[61] L.Lusa and others, “Class prediction for high-dimensional class-imbalanced data,” BMC Bioinformatics, vol. 11, no. 1, pp. 1–17, 2010.
[62] H.Chen, S.Jajodia, J.Liu, N.Park, V.Sokolov, and V. S.Subrahmanian, “Faketables: Using GANs to generate functional dependency preserving tables with bounded real data,” IJCAI Int. Jt. Conf. Artif. Intell., vol. 2019-Augus, no. August, pp. 2074–2080, 2019, doi: 10.24963/ijcai.2019/287.
[63] L.Xu, M.Skoularidou, A.Cuesta-Infante, and K.Veeramachaneni, “Modeling tabular data using conditional GAN,” Adv. Neural Inf. Process. Syst., vol. 32, no. NeurIPS, 2019.
[64] L.Xu and K.Veeramachaneni, “Synthesizing Tabular Data using Generative Adversarial Networks,” arXiv Prepr. arXiv1811.11264, 2018.
[65] N.Park, M.Mohammadi, K.Gorde, S.Jajodia, H.Park, and Y.Kim, “Data synthesis based on generative adversarial networks,” Proc. VLDB Endow., vol. 11, no. 10, pp. 1071–1083, 2018, doi: 10.14778/3231751.3231757.
[66] M. K.Baowaly, C.-C.Lin, C.-L.Liu, and K.-T.Chen, “Synthesizing electronic health records using improved generative adversarial networks,” J. Am. Med. Informatics Assoc., vol. 26, no. 3, pp. 228–241, 2019.
[67] E.Choi, S.Biswal, B.Malin, J.Duke, W. F.Stewart, and J.Sun, “Generating multi-label discrete patient records using generative adversarial networks,” in Machine learning for healthcare conference, 2017, pp. 286–305.
[68] P. H.Lu, P. C.Wang, and C. M.Yu, “Empirical evaluation on synthetic data generation with generative adversarial network,” ACM Int. Conf. Proceeding Ser., pp. 1–6, 2019, doi: 10.1145/3326467.3326474.
[69] D. R.Wilson and T. R.Martinez, “Reduction techniques for instance-based learning algorithms,” Mach. Learn., vol. 38, no. 3, pp. 257–286, 2000.
[70] J. J.Grefenstette, “Optimization of Control Parameters for Genetic Algorithms,” IEEE Trans. Syst. Man Cybern., vol. 16, no. 1, pp. 122–128, 1986, doi: 10.1109/TSMC.1986.289288.
[71] U.Tantipongpipat, C.Waites, D.Boob, A. A.Siva, and R.Cummings, “Differentially Private Synthetic Mixed-Type Data Generation For Unsupervised Learning,” arXiv Prepr. arXiv1912.03250, pp. 1–39, 2019.
[72] M.Scott and J.Plested, “GAN-SMOTE: A Generative Adversarial Network approach to Synthetic Minority Oversampling.,” Aust. J. Intell. Inf. Process. Syst., vol. 15, no. 2, pp. 29–35, 2019.
[73] S.Ioffe and C.Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,” in International conference on machine learning, Jun. 2015, pp. 448–456.
[74] X.Wu et al., “Top 10 algorithms in data mining,” Knowl. Inf. Syst., vol. 14, no. 1, pp. 1–37, 2008.

指導教授

蔡志豐蘇坤良(Chih-Fong Tsai Kuen-Liang Su)

審核日期

2021-9-8

推文