不平衡數據的機器學習發展暨可視化辨識模型之應用

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：73

、訪客IP：3.16.212.32

姓名

許哲彰(Che-Chang Hsu) 查詢紙本館藏

畢業系所

機械工程學系

論文名稱

不平衡數據的機器學習發展暨可視化辨識模型之應用
(Machine learning development of imbalanced data and application of visual recognition model)

相關論文

★ 非等強度分負荷系統之動態負載配置研究	★ 倒傳遞類神經網路學習收斂之初步探討
★ 材料強度退化與累積損傷之探討	★ 累積失效與可靠度關係之探討
★ 碳鋼材料在二氧化硫環境下之腐蝕可靠度行為之探討	★ 動態可靠度模型之探討及其應用
★ 多目標量子搜尋之參數調控演算法	★ 低通濾波器設計可靠度分析
★ 光纖材料之靜力疲勞可靠度分析	★ 競爭策略於系統行為之探討
★ 應用動態可靠度模型預估電解電容器壽命之探討	★ 有限平板多條邊裂紋成長之探討
★ 厚度或折射率變異對窄帶通濾光片之可靠度分析	★ 馬可夫過程的預防維護模型之研究
★ 馬可夫過程在技術成長上之研究	★ 應用馬可夫預防維護模型於維修保養策略之探討

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

不平衡數據集在機器學習的許多應用場景中是一個普遍存在的問題。如何在訓練集的某些類擁有較多的樣本，而某些類只有相對較少的樣本情況下，解決傳統分類器對少類分類失準的問題已成為機器學習目前面臨的一個挑戰。本研究從算法層面(algorithm level)出發，提出一種結合貝葉斯分類器與支持向量機的新模型，即重新平衡支持向量機(SVM-rebalancing)。在這個學習過程中，重新平衡參數(分類權值參數)提供了一個使各類別的分類權值趨於平衡的協調，並藉由求解重新平衡規劃問題使少類樣本獲得有效的可識別性。本研究次要旨在瞭解造成錯誤分類的可能來源是否不僅是不平衡，還是尚有其他因素導致這些誤分類。鑒於模式識別的純預測模型缺乏可視化理解訊息，像類神經網路和支持向量機這樣的黑盒方法(black box)無法提供可解釋的模型，造成了對誤分類的原因無法探究其根源。因此，本研究提出對核函數進行多元尺度變換的前處理以來建構低維數據的表示空間。在實踐中，可視化辨識模型表明數據的重疊分布、多峰分布、偏態分布也是造成分類器的分類性能不佳的其他原因。最後，本研究給予一項建議是:採用這樣的可視化辨識模型策略能夠告訴我們數據結構所出現的問題，一旦想再繼續提升分類器的性能時就能往該方面進行後續改良。

摘要(英)

Imbalanced data is a common problem in many application domains of machine learning. How to solve the problem of misclassification of minority class samples by traditional classifiers has become a challenge in machine learning when some classes of training set have more samples, and some classes have relatively few samples. This paper proposes a new model combining Bayesian classifier and support vector machine (SVM) from the perspective of algorithm level, namely, SVM-rebalancing. In the learning process, the rebalance parameter (classification weight parameter) provides a coordination that balances the classification weight of each class. The problem is solved by rebalancing programming problem, so as to produce an effective identifiability for minority samples. The next study wants to understand whether the possible sources of misclassifications are not only the imbalance, but also other factors that cause to these misclassifications. In view of the purely predictive model of pattern recognition lacks visual understanding, black box methods such as neural networks and support vector machines cannot provide interpretable model, which makes it impossible to explore the sources of misclassification causes. Therefore, this study further proposes a pre-processing of multidimensional scaling of kernel functions to construct a visual low-dimensional data representation space. In practice, the visual recognition model indicates that the overlapping distribution, multimodal distribution, and skewed distribution of the data in the database are also other causes of poor classification performance of the classifier. Finally, this research gives a suggestion that using such a visual identification model strategy can tell us the problems that arise in the data structure, and once we further want to improve the performance of the classifier, we can make subsequent improvements in this aspect.

關鍵字(中)

★ 重新平衡支持向量機
★ 可視化辨識模型
★ 多元尺度變換

關鍵字(英)

★ SVM-rebalancing
★ visual recognition model
★ multidimensional scaling

論文目次

中文提要 ……………………………………………………………… i
英文提要 ……………………………………………………………… ii
誌謝 ……………………………………………………………… iii
目錄 ……………………………………………………………… iv
圖目錄 ……………………………………………………………… v
表目錄 ……………………………………………………………… vi
符號說明 ……………………………………………………………… vii
一、緒論………………………………………………………… 1
1-1 研究背景…………………………………………………… 1
1-2 研究動機與目的…………………………………………… 4
1-3 思考脈絡與研究方法……………………………………… 5
1-4 論文架構…………………………………………………… 13
二、背景介紹…………………………………………………… 14
2-1 敘述性統計的測定………………………………………… 14
2-2 數據預處理………………………………………………… 15
2-3 統計學習理論與支持向量機之概論……………………… 16
2-4 數據不平衡對傳統模式分類的影響……………………… 20
2-5 類別不平衡下的分類器評估指標………………………… 21
三、以平衡規劃的觀點來處理數據不平衡問題……………… 25
四、案例測試…………………………………………………… 34
4-1 數據集概略………………………………………………… 34
4-2 SVM-rebalancing策略的模擬案例&產品開發應用實例… 36
4-2-1 人造數據上的模擬案例…………………………………… 36
4-2-2 產品開發應用實例………………………………………… 39
4-3 幾種算法層面方法的性能比較…………………………… 44
五、誤分類的殘存問題………………………………………… 55
5-1 殘存誤分類問題…………………………………………… 55
5-2 建構可視化的低維數據表示空間………………………… 56
5-3 核函數的多元尺度變換…………………………………… 58
5-4 可視化辨識模型對誤分類原因提供一種視覺化信息…… 61
六、結論與展望………………………………………………… 69
6-1 結論………………………………………………………… 69
6-2 展望………………………………………………………… 71
參考文獻 ……………………………………………………………… 75

參考文獻

[1] Han, J., Kamber, M., Data Mining Concepts and Techniques., 2nd Ed., Morgan Kaufmann Publishers Inc., San Francisco, CA, 2000.
[2] Hastie, T., Tibshirani, R., & Friendman, J., The Elements of Statistical Learning: Data Mining, Inference and Prediction., Springer-Verlag, Berlin, Heidelberg, and New York, 2001.
[3] Witten, I. H., & Frank, E., Data Mining: Practical Machine Learning Tools and Techniques., 2nd Ed., Morgan Kaufmann Publishers Inc., San Francisco, CA, 2005.
[4] Webb, A. R., Statistical Pattern Recognition., 2nd Ed., John Wiley & Sons, Chichester, England, 2002
[5] Chawla, N. V., Japcowicz, N., & Kolcz, A., “Editorial: Special Issue on learning from imbalanced datasets”, ACM SIGKDD Explorations Newsletter, Vol. 6, no. 1, pp. 1-6, 2004.
[6] Batista, G. E. A. P. A., Prati, R. C., & Monard, M. C., “A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data”, ACM SIGKDD Explorations Newsletter, Vol. 6, no. 1, pp. 20-29, 2004.
[7] Visa, S., & Ralescu, A., “Issues in Mining Imbalanced Data Sets - A Review Paper”, In: Proceeding of the Sixteen Midwest Artificial Intelligence and Cognitive Science Conference, Dayton, Ohio, USA, pp. 67-73, 2005.
[8] Kotsiantis, S., Kanellopoulos, D., & Pintelas, P., “Handling imbalanced datasets: A review”, GESTS International Transactions on Computer Science and Engineering, Vol. 30, no. 1, pp. 25-36, 2006.
[9] Merz, C.J., & Murphy, P.M., UCI Repository of machine learning databases. University of California, Irvine School of Law, http://www.ics.uci.edu/~mlearn/MLRepository.html.
[10] Vapnik, V. N., The Nature of Statistical Learning Theory., Springer-Verlag, Berlin Heidelberg, New York, 1995.
[11] Vapnik, V. N., “An Overview of Statistical Learning Theory”, IEEE Transaction on Neural Networks, Vol. 10, pp. 988-999, 1999.
[12] Duda, R. O., Hart, P. E., & Stork, D. G., Pattern classification., 2nd Ed., John Wiley & Sons, Inc., New York, 2001.
[13] Hsu, C. C., Wang, K. S., Chung, H. Y., & Chang, S. H., “Equation of SVM-rebalancing: the point-normal form of a plane for class imbalance problem”, Neural Computing and Applications, DOI https://doi.org/10.1007/s00521-018-3419-z, 2018. (Accepted)
[14] Provost, F., & Fawcett, T., “Robust Classification for Imprecise Environments”, Machine Learning, Vol. 42, no. 3, pp. 203–231, 2001.
[15] Wu, G., & Chang, E. Y., “Class-boundary alignment for imbalanced dataset learning”, In: Proceedings of the ICML’03 Workshop on Learning from Imbalanced Datasets, pp. 49-56, 2003.
[16] Veropoulos, K., Campbell, C., & Cristianini, N., “Controlling the sensitivity of support vector machines”, In: Proceedings of the International Joint Conference on AI, pp. 55-60, 1999.
[17] Chawla, N. V., Data mining and knowledge discovery handbook., Springer, Boston, MA, 2005.
[18] Akbani, R., Kwek, S., & Japkowicz, N., “Applying Support Vector Machines to imbalanced Datasets”, In: Proceedings 15th ECML, pp. 39-50, 2004.
[19] Yan, R., Liu,Y., & Jin, R., “On Predicting Rare Classes with SVM Ensembles in Scene Classification”, In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP′03), Hong Kong, pp. 21-24, Apr. 2003.
[20] Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P., “SMOTE: Synthetic Minority Over-sampling Technique”, Journal of Artificial Intelligence Research, Vol. 16, pp. 321-357, 2002.
[21] Tang, Y., Zhang, Y.-Q., Chawla, N. V., & Krasser, S., “SVMs Modeling for Highly Imbalanced Classification”, IEEE Transactions on Systems, Man, and Cybernetics, Part B, Vol. 39, no. 1, pp. 281-288, 2009.
[22] Domingos, P., “MetaCost: A general method for making classifiers cost-sensitive”, In: proceedings of the Fifth International Conference on Knowledge Discovery and Data Mining, San Diego, CA: ACM Press, pp. 155-164, 1999.
[23] Tomek, I., “Two Modifications of CNN”, IEEE Transactions on Systems Man and Communications, SMC-6, pp. 769-772, 1976.
[24] Ho, T. K., "Random Decision Forest", In: proceedings of the 3rd Int′l Conf on Document Analysis and Recognition, Montreal, Canada, pp. 278-282, August, 1995.
[25] Wu, G., & Chang, E. Y., “KBA: Kernel Boundary Alignment Considering Imbalanced Data Distribution”, IEEE Transaction on Knowledge and Data Engineering, Vol. 17, no. 6, pp. 786-795, 2005.
[26] Chandola, V., Banerjee, A., & Kumar, V., “Anomaly detection: A survey”, ACM Computing Surveys, Vol. 41, no. 3, pp.1-58, 2009.
[27] Zheng, Z., Wu, X., & Srihari, R., “Feature selection for text categorization on imbalanced Data”, ACM SIGKDD Explorations Newsletter, Vol. 6, no. 1, pp. 80-89, 2004.
[28] 鍾鴻源，何誌祥，「基於貝氏資訊之萃取應用於支持向量機之類不平衡分類問題」，國立中央大學，碩士論文，民國98年。
[29] Hsu, C. C., Wang, K. S., & Chang, S. H., “Bayesian decision theory for support vector machines: Imbalance measurement and feature optimization”, Expert Systems With Applications, Vol. 38, no. 5, pp. 4698-4704, May 2011.
[30] Chung, H. Y., Ho, C. H., & Hsu, C. C., “Support vector machines using Bayesian-based approach in the issue of unbalanced classifications”, Expert Systems With Applications, Vol. 38, no. 9, pp. 11447-11452, September 2011.
[31] Kubat, M., & Matwin, S., “Addressing the Curse of Imbalanced Training Sets: One-sided Selection”, In: Proceedings of the 14th International Conference on Machine Learning, Nashville, TN, pp.179-186, 1997.
[32] Van Rijsbergen, C. J., Information Retrieval., 2nd Ed., Butterworths, London, U.K, 1979.
[33] Buckland, M., & Gey, F., “The relationship between Recall and Precision”, Journal of American Society for Information Science, Vol. 45, no. 1, pp. 12-19, 1994.
[34] Bradley, A. P., “The use of the area under the ROC curve in the evaluation of machine learning algorithms”, Pattern Recognition, Vol. 30, no. 7, pp. 1145-1159, Jul. 1997.
[35] Cieslak, D. A., & Chawla, N. V., “Learning Decision Trees for Unbalanced Data”, European Conference on Principles and Practice of Knowledge Discovery in Databases, Antwerp, Belgium, pp. 241-256, 2008.
[36] Suykens, J. A. K., Van Gestel, T., De Brabanter, J., De Moor, B., & Vandewalle, J., Least squares support vector machines., World Scientific Publishing Co. Pte. Ltd., Singapore, 2002.
[37] Anderson, D. R., Sweeney, D. J., & Williams, T. A., Statistics for Business and Economics., 8nd Ed., Southwestern, Cincinnati, 2002.
[38] Vapnik, V. N., The Nature of Statistical Learning Theory., Springer-Verlag, Berlin Heidelberg, New York, 1995.
[39] Vapnik, V. N., “An Overview of Statistical Learning Theory”, IEEE Transaction on Neural Networks, Vol. 10, pp. 988–999, 1999.
[40] 蘇木春、張孝德，機器學習：類神經網路、模糊系統以及基因演算法則，四版，全華出版社，台北市，2016年。
[41] 葉怡成，類神經網路模式應用與實作，九版，儒林出版社，台北市，2009年。
[42] 邊肇祺，張學工等編著，模式識別，二版，清華大學出版社，北京市，2000年。
[43] 周志華，王玨，機器學習及其應用，清華大學出版社，北京市，2009年。
[44] Rokach, L., Pattern classification using ensemble methods., World Scientific Publishing Co. Pte. Ltd., Singapore, 2010.
[45] Joshi, M. V., “On evaluating performance of classifiers for rare classes”, the Second IEEE International Conference on Data Mining (ICDM′02), Washington, D. C., USA, pp. 641-644, 2002.
[46] Breiman, L., “Bias, Variance and Arcing Classifiers”, Technical Report 460, Statistics Department, University of California, Berkeley, 1996.
[47] 徐天祿，陳俊言，盧欣農，許智誠，許哲彰，“驗鈔機的感測方法”，中華民國發明專利第I626625號，公告日2018年。
[48] 菲謝蒂(Mark Fischetti)著，”驗鈔機如何認出假鈔?”，鍾樹人譯，科學人雜誌，遠流出版公司，第20期，10月號，2003年。
[49] 朱昭蓉，錢迺文:2018年國際鈔券研討會-公務出國報告資訊網。2018年8月23日，取自https://report.nat.gov.tw/ReportFront/PageSystem/reportFileDownload/C10701146/001。
[50] Weston, J., & Watkins, C., “Support Vector Machines for Multi-Class Pattern Recognition”, In: Proceedings of the Seventh European Symposium On Artificial Neural Networks, Bruges, Belgium, pp. 219-224, 1999.
[51] Krishnaiah, P. R., & Kanal, L. N., Classification, Pattern Recognition, and Reduction of Dimensionality., North-Holland Pub. Co., New York, 1982.
[52] Platt, J., Cristianini, N., & Shawe-Taylor, J., “Large margin DAGs for multiclass classification”, In: Advances in Neural Information Processing Systems, MIT Press, Cambridge, Massachussets, pp. 547-553, 2000.
[53] Hsu, C. C., Wang, K. S., Chung, H. Y., & Chang, S. H., “A study of visual behavior of multidimensional scaling for kernel perceptron algorithm”, Neural Computing and Applications, Vol. 26, no. 3, pp. 679-691, 2015.

指導教授

王國雄(Kuo-Shong Wang)

審核日期

2019-7-24

推文