在資料不平衡下提升分類器性能之策略研究

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：58

、訪客IP：3.139.233.17

姓名

陳詠俊(Yung-Chun Chen) 查詢紙本館藏

畢業系所

工業管理研究所

論文名稱

在資料不平衡下提升分類器性能之策略研究
(A study on strategies of improving performance under class imbalanced problem)

相關論文

★ 二階段作業研究模式於立體化設施規劃應用之探討–以半導體製造廠X及Y公司為例	★ 推行TPM活動以改善設備總合效率並提昇企業競爭力...以U公司桃園工廠為例
★ 資訊系統整合業者行銷通路策略之研究	★ 以決策樹法歸納關鍵製程暨以群集法識別關鍵路徑
★ 關鍵績效指標(KPI)之建立與推行 - 在造紙業	★ 應用實驗計劃法- 提昇IC載板錫球斷面品質最佳化之研究
★ 如何從歷史鑽孔Cp值導出新設計規則進而達到兼顧品質與降低生產成本目標	★ 產品資料管理系統建立及導入-以半導體IC封裝廠C公司為例
★ 企業由設計代工轉型為自有品牌之營運管理	★ 運用六標準差步驟與FMEA於塑膠射出成型之冷料改善研究(以S公司為例)
★ 台灣地區輪胎產業經營績效之研究	★ 以方法時間衡量法訂定OLED面板蒸鍍有機材料更換作業之時間標準
★ 利用六標準差管理提升生產效率－以Ａ公司塗料充填流程改善為例	★ 依流程相似度對目標群組做群集分析- 以航空發動機維修廠之自修工件為例
★ 設計鏈績效衡量指標建立 —以電動巴士產業A公司為例	★ 應用資料探勘尋找影響太陽能模組製程良率之因子研究

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

分類問題是在機器學習上相當重要的一個研究主題，透過模型我們可以自動地將
龐大資料中的標籤分類出來，讓決策者能省下大量的時間就從交易紀錄、機台資料等
來源中得到可用的資訊。當中類別不平衡（Class imbalanced）是相當重要的一個問
題，若資料中的類別數量相差較大，會使得模型難以正確的分類。過去的研究已經提
出了相當多的方法來改善此問題，但主要都著重在分類指標分數的提高，對於改善時
產生的潛在變異性著墨較少。忽略使用改善方法造成的不穩定性，決策者依照模型分
類出的結果可能受到訓練資料不同而有較大差異，造成決策上的錯估。本篇研究嘗試
探討在類別不平衡的情況下，找到一種可以穩定提高分類器表現的策略。希望此策略
能協助決策者做出穩健的決策，而不必擔心訓練當中可能的不確定。
在本篇研究中，我們以兩種真實資料來呈現類別不平衡問題。設計不同類別數量
的不平衡比例，或是資料集大小，檢視對分類器的影響。當中使用了三種常見的分類
器，分別是 Logistic Regression、Support Vector Machine 以及 Random Forest。根據實驗
結果，我們從中試著找到影響提高模型表現時的穩定與否的主要原因，並提出一個用
以量測穩定性的指標。最後，我們提出一套能讓模型在類別不平衡下穩定的提高表現
的策略。

摘要(英)

Classification is one of common topic in machine learning. We can automatically
recognize the labels by the classification models. It saves lots of time and make the massive
information from digital transaction or machine log being usable. Class imbalanced problem
is one of the most important and popular issue in this field. Under imbalanced ratio of classes,
the classifiers can’t make classification very well. Researchers have been proposed several
methods to solve this problem. However, most of methods only focus on the enhancement of
certain measurements. Ignoring the variation of results, decision makers may face a trouble
that over or underestimating the classifies due to different training datasets, leading to an
unsuitable decision. In this study, we try to find a strategy to improve the performance of
classifiers stably under class imbalanced. With this strategy, decision makers can make a
robust decision without worrying about the huge variation of classification results.
We conduct a series of experiments with two real-world datasets to present the class
imbalanced problem in this study, including the situation which being used different
imbalanced ratios and sizes of datasets. Three classification models are used in the
experiments, that is Logistic Regression, Support Vector Machine and Random Forest models.
We examine the effects of Cost-sensitive and Under-sampling methods with these three
models. According to the results of experiments, we try to find the main causes to stability
and propose a method to describe the stability of improvement methods. In the end, we
conduct a strategy to raising the ability of classifiers in a stable way

關鍵字(中)

★ 分類
★ 資料類別不平衡
★ 成本敏感方法
★ 穩定性

關鍵字(英)

★ classification
★ class imbalanced problem
★ cost-sensitive methods
★ stability

論文目次

中文摘要.....................................................................................................................................i
Abstract.......................................................................................................................................ii
Contents.....................................................................................................................................iii
List of Tables .............................................................................................................................iv
List of Figures............................................................................................................................iv
Chapter 1 Introduction ............................................................................................................1
1.1 Background and Motivation ...................................................................................1
1.2 Research Objectives ...............................................................................................2
1.3 Research framework ...............................................................................................3
Chapter 2 Literature Review ...................................................................................................4
2.1 Class Imbalanced Problem with Measurement ......................................................4
2.2 Classifiers...............................................................................................................6
2.3 Proposed improvement methods ............................................................................8
2.3.1 Under-sampling approach...........................................................................8
2.3.2 Cost-sensitive approach..............................................................................9
Chapter 3 Methodology.........................................................................................................12
3.1 Classifiers and Measurement................................................................................12
3.2 Improvement Methods..........................................................................................16
3.3 Datasets.................................................................................................................17
Chapter 4 Numerical Example ..............................................................................................22
Chapter 5 Conclusion............................................................................................................40
Reference ..................................................................................................................................43

參考文獻

[1] Ali, A., Shamsuddin, S. M., & Ralescu, A. L. (2013). “Classification with class imbalance
problem.” Int. J. Advance Soft Compu. Appl, 5(3).
[2] Breiman, L. (2001). “Random forests.” Machine learning, 45(1), 5-32.
[3] Cain, M., & Janssen, C. (1995). “Real estate price prediction under asymmetric loss.”
Annals of the Institute of Statistical Mathematics, 47(3), 401-414.
[4] Christoffersen, P. F., & Diebold, F. X. (1997). “Optimal prediction under asymmetric
loss.” Econometric Theory, 13(6), 808-817.
[5] Cortes, C., & Vapnik, V. (1995). “Support-vector networks.” Machine learning, 20(3),
273-297.
[6] Davis, J., & Goadrich, M. (2006). “The relationship between Precision-Recall and ROC
curves.” Proceedings of the 23rd international conference on Machine learning, 233-240.
[7] Domingos, P. (1999). “Metacost: A general method for making classifiers cost-sensitive.”
Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery
and data mining, 155-164.
[8] Drummond, C., & Holte, R. C. (2003). “C4. 5, class imbalance, and cost sensitivity: why
under-sampling beats over-sampling.” Workshop on learning from imbalanced
datasets ,11, 1-8
[9] He, H., & Garcia, E. A. (2009). “Learning from imbalanced data.” IEEE Transactions on
knowledge and data engineering, 21(9), 1263-1284.
[10] Kotsiantis, S. B., Zaharakis, I. D., & Pintelas, P. E. (2006). “Machine learning: a review
of classification and combining techniques.” Artificial Intelligence Review, 26(3), 159-
190.
[11] Kukar, M., Kononenko, I. (1998). “Cost-sensitive learning with neural
networks.” ECAI,15, 88-94.
44
[12] Liaw, A., & Wiener, M. (2002). “Classification and regression by randomForest.” R
news, 2(3), 18-22.
[13] Maalouf, M., & Siddiqi, M. (2014). “Weighted logistic regression for large-scale
imbalanced and rare events data.” Knowledge-Based Systems, 59, 142-148.
[14] Bach, M., Werner, A., & Palt, M. (2019), “The Proposal of Undersampling Method for
Learning from Imbalanced Datasets.” Procedia Computer Science,159, 125-134.
[15] Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). “SMOTE:
synthetic minority over-sampling technique.” Journal of artificial intelligence research, 16,
321-357.
[16] Owen, A. B. (2007). “Infinitely Imbalanced Logistic Regression.” Journal of Machine
Learning Research, 8(4).
[17] Pregibon, D. (1981). “Logistic regression diagnostics.” The annals of statistics, 9(4),
705-724.
[18] Sadouk, L., Gadi, T., & Essoufi, E. H. (2021).” A novel cost‐sensitive algorithm and new
evaluation strategies for regression in imbalanced domains.” Expert Systems, 38(4),
e12680.
[19] Safavian, S. R., & Landgrebe, D. (1991).” A survey of decision tree classifier
methodology.” IEEE transactions on systems, man, and cybernetics, 21(3), 660-674
[20] Stehman, S. V. (1997). “Selecting and interpreting measures of thematic classification
accuracy.” Remote Sensing of Environment, 62, 77-89.
[21] Sun, Y., Kamel, M. S., Wong, A. K., & Wang, Y. (2007). “Cost-sensitive boosting for
classification of imbalanced data.” Pattern recognition, 40(12), 3358-3378.
[22] Tang, Y., Zhang, Y. Q., Chawla, N. V., & Krasser, S. (2008). “SVMs modeling for highly
imbalanced classification.” IEEE Transactions on Systems, Man, and Cybernetics, 39(1),
281-288.
45
[23] Wang, T., Qin, Z., Jin, Z., & Zhang, S. (2010). “Handling over-fitting in test costsensitive decision tree learning by feature selection, smoothing and pruning.” Journal of
Systems and Software, 83(7), 1137-1147.
[24] Zadrozny, B., & Elkan, C. (2001). “Learning and making decisions when costs and
probabilities are both unknown.” Proceedings of the 7th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, 204-213.

指導教授

曾富祥(Fu-Shiang Tseng)

審核日期

2022-7-11

推文