單一與並列式集成特徵選取方法於多分類類別不平衡問題之研究;Comparison of single feature selection and ensemble feature selection in multi-class imbalanced classification

NCU Institutional Repository > 管理學院 > 資訊管理研究所 > 博碩士論文 > Item 987654321/93183

jsp.display-item.identifier=請使用永久網址來引用或連結此文件: http://ir.lib.ncu.edu.tw/handle/987654321/93183

题名:	單一與並列式集成特徵選取方法於多分類類別不平衡問題之研究;Comparison of single feature selection and ensemble feature selection in multi-class imbalanced classification
作者:	陳冠蓁;Chen, Kuan-Chen
贡献者:	資訊管理學系
关键词:	多分類特徵選取;類別不平衡;集成學習;分類;multiclass feature selection;class imbalance;ensemble learning;classification
日期:	2023-07-17
上传时间:	2024-09-19 16:46:30 (UTC+8)
出版者:	國立中央大學
摘要:	近年來個人裝置、嵌入式系統盛行，資料透過網路從全世界匯集成許多高維度資料，這些資料中僅僅單一資料集就可能達到數個拍位元（Petabyte），這些巨量資料雖然對企業來說增加了許多新的商業機會，但其高維度的特性同時也讓很多企業都感到困擾，由於資料量過於龐大所以企業需要更多的儲存空間，而且若是要利用這些高維度資料建立資料探勘模型，會使得訓練時間很長並可能導致模型學習表現不佳，為了避免上述高維度特性所產生的問題，可以採用資料前處理方法中常被使用的特徵選取技術來降低資料的維度，本研究也因此以特徵選取為主要研究，希望探討出不同高維度資料集其最佳的特徵選取方法。在分類問題中，現今多數特徵選取相關研究都是採用二元分類，但真實世界中，多分類的分類問題也是需要處理的問題。在多分類特徵選取相關文獻中，較少有同時應用過濾類（Filter）、包裝類（Wrapper）、嵌入類（Embedded）這三類型的特徵選取技術，且多分類特徵選取的文獻中，多分類還未有單一特徵選取方法的並列式集成特徵選取技術搭配。本研究針對十個高維度多分類類別不平衡資料集，應用三種類型的單一特徵選取方法，含六個過濾類（Filter）、五個包裝類（Wrapper）、四個嵌入類（Embedded）並引用並列式集成的概念進行特徵選取。同時對於類別不平衡的問題，採用資料層級（Data Level）中增加少數法SMOTE，將樣本分布平衡化，最後紀錄平均正確率、平均 ROC曲線下面積、運算時間，欲探討哪種是最佳的特徵選取方法的組合。從本研究的實驗結果來看，針對多分類不平衡資料集，建議使用SMOTE方法，此外，崁入類特徵選取在SVM分類器，進行先SMOTE後特徵選取，搭配Lasso+XGBoost的聯集，有最高的預測表現之平均正確率；崁入類特徵選取在SVM分類器，進行先特徵選取後SMOTE，搭配Lasso+RandomForest+XGBoost聯集，有最高的預測表現之平均ROC曲線下面積。 ;In recent years, personal devices and embedded systems have become prevalent. Data is collected from all over the world into many high-dimensional data through the Internet. Only a single data set of these data may reach several petabytes. For example, many new business opportunities have been added, but its high-dimensional characteristics also trouble many companies. Because the amount of data is too large, companies need more storage space, and if they want to use these high-dimensional data to establish data mining. The model will take a long time to train and may lead to poor model learning performance. In order to avoid the problems caused by the above-mentioned high-dimensional characteristics, the feature selection technology often used in the data preprocessing method can be used to reduce the data dimension. Therefore, feature selection is the main research, hoping to explore the best feature selection method for different high-dimensional datasets. Ensemble feature selection in high dimension, low sample size datasets: Parallel and serial combination approaches In classification problems, most of the current research on feature selection uses binary classification, but in the real world, multi-class classification problems are also problems that need to be dealt with. In literature related to multi-class feature selection, there are few feature selection methods that apply all the three types of filter, wrapper, and embedded. There is also no parallel Ensemble feature selection technology collocation of a single multiclass feature selection method. This study applies three types of single feature selection methods for ten high-dimensional imbalanced datasets, including six filter methods, five wrapper methods, four embedded methods. At the same time, for the problem of imbalance datasets, the SMOTE is added to the data level to make the samples be balance, and finally the average accuracy rate, average area under the ROC curve, and computing time are recorded. From the results of this experiment, it is recommended to use the SMOTE method for multi-class unbalanced datasets. In addition, the embedded feature selection method is selected in the SVM classifier, and SMOTE is performed first and then the feature selection is performed. The combination of Lasso+XGBoost has the highest average accuracy of prediction performance; Second, the embedded features are selected in the SVM classifier, and the features are selected first and then SMOTE. With the Lasso+RandomForest+XGBoost union, there is the highest average area under the ROC curve of the prediction performance.
显示于类别:	[資訊管理研究所] 博碩士論文

文件中的档案:

档案	描述	大小	格式	浏览次数
index.html		0Kb	HTML	14	检视/开启

在NCUIR中所有的数据项都受到原著作权保护.

社群 sharing

数据加载中.....