單一與並列式集成特徵選取方法於多分類類別不平衡問題之研究

DC 欄位	值	語言
DC.contributor	資訊管理學系	zh_TW
DC.creator	陳冠蓁	zh_TW
DC.creator	Kuan-Chen Chen	en_US
dc.date.accessioned	2023-7-17T07:39:07Z
dc.date.available	2023-7-17T07:39:07Z
dc.date.issued	2023
dc.identifier.uri	http://ir.lib.ncu.edu.tw:444/thesis/view_etd.asp?URN=109423021
dc.contributor.department	資訊管理學系	zh_TW
DC.description	國立中央大學	zh_TW
DC.description	National Central University	en_US
dc.description.abstract	近年來個人裝置、嵌入式系統盛行，資料透過網路從全世界匯集成許多高維度資料，這些資料中僅僅單一資料集就可能達到數個拍位元（Petabyte），這些巨量資料雖然對企業來說增加了許多新的商業機會，但其高維度的特性同時也讓很多企業都感到困擾，由於資料量過於龐大所以企業需要更多的儲存空間，而且若是要利用這些高維度資料建立資料探勘模型，會使得訓練時間很長並可能導致模型學習表現不佳，為了避免上述高維度特性所產生的問題，可以採用資料前處理方法中常被使用的特徵選取技術來降低資料的維度，本研究也因此以特徵選取為主要研究，希望探討出不同高維度資料集其最佳的特徵選取方法。在分類問題中，現今多數特徵選取相關研究都是採用二元分類，但真實世界中，多分類的分類問題也是需要處理的問題。在多分類特徵選取相關文獻中，較少有同時應用過濾類（Filter）、包裝類（Wrapper）、嵌入類（Embedded）這三類型的特徵選取技術，且多分類特徵選取的文獻中，多分類還未有單一特徵選取方法的並列式集成特徵選取技術搭配。本研究針對十個高維度多分類類別不平衡資料集，應用三種類型的單一特徵選取方法，含六個過濾類（Filter）、五個包裝類（Wrapper）、四個嵌入類（Embedded）並引用並列式集成的概念進行特徵選取。同時對於類別不平衡的問題，採用資料層級（Data Level）中增加少數法SMOTE，將樣本分布平衡化，最後紀錄平均正確率、平均 ROC曲線下面積、運算時間，欲探討哪種是最佳的特徵選取方法的組合。從本研究的實驗結果來看，針對多分類不平衡資料集，建議使用SMOTE方法，此外，崁入類特徵選取在SVM分類器，進行先SMOTE後特徵選取，搭配Lasso+XGBoost的聯集，有最高的預測表現之平均正確率；崁入類特徵選取在SVM分類器，進行先特徵選取後SMOTE，搭配Lasso+RandomForest+XGBoost聯集，有最高的預測表現之平均ROC曲線下面積。	zh_TW
dc.description.abstract	In recent years, personal devices and embedded systems have become prevalent. Data is collected from all over the world into many high-dimensional data through the Internet. Only a single data set of these data may reach several petabytes. For example, many new business opportunities have been added, but its high-dimensional characteristics also trouble many companies. Because the amount of data is too large, companies need more storage space, and if they want to use these high-dimensional data to establish data mining. The model will take a long time to train and may lead to poor model learning performance. In order to avoid the problems caused by the above-mentioned high-dimensional characteristics, the feature selection technology often used in the data preprocessing method can be used to reduce the data dimension. Therefore, feature selection is the main research, hoping to explore the best feature selection method for different high-dimensional datasets. Ensemble feature selection in high dimension, low sample size datasets: Parallel and serial combination approaches In classification problems, most of the current research on feature selection uses binary classification, but in the real world, multi-class classification problems are also problems that need to be dealt with. In literature related to multi-class feature selection, there are few feature selection methods that apply all the three types of filter, wrapper, and embedded. There is also no parallel Ensemble feature selection technology collocation of a single multiclass feature selection method. This study applies three types of single feature selection methods for ten high-dimensional imbalanced datasets, including six filter methods, five wrapper methods, four embedded methods. At the same time, for the problem of imbalance datasets, the SMOTE is added to the data level to make the samples be balance, and finally the average accuracy rate, average area under the ROC curve, and computing time are recorded. From the results of this experiment, it is recommended to use the SMOTE method for multi-class unbalanced datasets. In addition, the embedded feature selection method is selected in the SVM classifier, and SMOTE is performed first and then the feature selection is performed. The combination of Lasso+XGBoost has the highest average accuracy of prediction performance; Second, the embedded features are selected in the SVM classifier, and the features are selected first and then SMOTE. With the Lasso+RandomForest+XGBoost union, there is the highest average area under the ROC curve of the prediction performance.	en_US
DC.subject	多分類特徵選取	zh_TW
DC.subject	類別不平衡	zh_TW
DC.subject	集成學習	zh_TW
DC.subject	分類	zh_TW
DC.subject	multiclass feature selection	en_US
DC.subject	class imbalance	en_US
DC.subject	ensemble learning	en_US
DC.subject	classification	en_US
DC.title	單一與並列式集成特徵選取方法於多分類類別不平衡問題之研究	zh_TW
dc.language.iso	zh-TW	zh-TW
DC.title	Comparison of single feature selection and ensemble feature selection in multi-class imbalanced classification	en_US
DC.type	博碩士論文	zh_TW
DC.type	thesis	en_US
DC.publisher	National Central University	en_US

博碩士論文 109423021 完整後設資料紀錄