不平衡數據建模採樣並使⽤機器學習早期預測學習成效

DC 欄位	值	語言
DC.contributor	資訊工程學系在職專班	zh_TW
DC.creator	高璵勝	zh_TW
DC.creator	Yu-Sheng Kao	en_US
dc.date.accessioned	2022-8-13T07:39:07Z
dc.date.available	2022-8-13T07:39:07Z
dc.date.issued	2022
dc.identifier.uri	http://ir.lib.ncu.edu.tw:88/thesis/view_etd.asp?URN=108552001
dc.contributor.department	資訊工程學系在職專班	zh_TW
DC.description	國立中央大學	zh_TW
DC.description	National Central University	en_US
dc.description.abstract	二十一世紀隨著資訊科技的進步，校園軟硬體設施及教學資陸續充實，數位學習平台在科技的輔助下應運而生，推升了教學虛實整合的應用。且近年在新冠疫情的催化下，教師與學生間的授課模式由實體授課轉變為線上教學，逐漸讓線上學習成為常態。不僅解決了學生居家隔離時的受教權益，也讓師生間的教學互動變得更豐富且多元。線上學習平台能在學生進行學習的同時，將學習行為相關的活動記錄於日誌內。例如瀏覽教材、線上測驗等，此即學生在學習平台上的學習歷程紀錄。這些紀錄隱含著影響學習成效的關鍵資訊。因此，若能找出一套有效的方法，分析學生的學習成績與使用線上學習系統期間的學習行為之相關性，並早期預測其學習成效。將有助於教師能提早給予被識別為高風險的學生進行學習輔導，或干預學生的學習行為與適度調整教學內容，以提升改善其學習效果。現今有多數學者透過資料探勘的方式來判斷學生的學習成效好壞。然而在學習成效不好的少數樣本下，若以傳統的資料探勘分類器進行分類，可能會將全部的樣本歸類為好的學習成效，主要因素是使用不平衡數據來建構分類器時，利用較多樣本數的類別來建構的分類器，其學習規則會不利於少量樣本數的類別。這使得少數學習成效不好的學生被判斷為學習成效好的機率可能大幅提升。有些研究對於不平衡數據的問題建議使用減少多數類別中樣本數的欠採樣法（Under-sampling），然而若以分群為基礎來刪除多餘的資料，這樣的方式可能會刪除多數類別中的重要資訊。因此有學者建議使用增加少數類別樣本的過採樣法（Over-sampling），來增加少量類別（Minority class）中的樣本數量。目前以 Chawla et al. 提出的合成少數類過採樣技術（Synthetic Minority Oversampling Technique, SMOTE）最常被使用，其主要是從原始數據集中隨機地取出少量類別附近的線性資料點成為新的少量類別資料，以改善少量類別的樣本數。另外，Haibo He, Yang Bai, Edwardo A. Garcia, Shutao Li等人在2008年提出自適應合成採樣方法（Adaptive synthetic sampling approach, ADASYN）。它給予每個少量類樣本具備各自的權重，若少數類樣本其相鄰的多數類樣本數越多，則該樣本的權重值越高，以此方式來確定每個少數類樣本需要合成的樣本數量。SMOTE雖能透過對所有少數樣本進行過採樣，但並非所有少數樣本都無鑑別度，其中與多數樣本混合的少數樣本才較無鑑別度。而進行過採樣後靠近邊界的少數樣本與多數樣本混合後，容易產生雜訊。若對邊界的樣本進行訓練學習，可能導致多數類樣本誤判為少數類樣本。因此有研究透過 Tomek Links、ENN兩種數據清洗方法來去除採樣後重疊的數據，進而提高少數樣本的識別度。本研究使用國立中央大學110學年度上學期Python程式設計課程的學習歷程紀錄來預測學生期末考的學習成效。其中學習歷程的序列化資料裡，及格與不及格學生的比例懸殊，若透過此不平衡數據來進行演算法訓練與預測，易導致演算法無法有效訓練學習，進而造成準確性悖論的現象。為了解決學習歷程紀錄數據不平衡的問題，並能夠有效提早預測學生學習成效，本研究透過Over-sampling進行不平衡數據的採樣處理，以平衡少數類樣本與多數類樣本間的比例。並且透過Under-sampling剔除不同類別間相互重疊的樣本，使最近鄰的樣本皆屬於同一類別，進而提升分類演算的分類效能。再透過SVM、Logistic Regression、Random Forest、KNN、Naïve Bayes及Decision Tree等六種演算法來進行學習成效預測，並比較各演算法的預測效果。從研究中觀察到透過不同採樣法處理後的數據集，經各演算法進行分類預測後，在 Accuracy、Recall、F1-score、AUC、G-mean、MCC等指標，相對未處理原始數據集均有明顯提升。不平衡數據透過採樣法處理後，能有效的解決少數類樣本在演算法的分類問題，避免在訓練時所帶來的災難性後果，且能大幅提升演算法預測的準確性。	zh_TW
dc.description.abstract	With the progress of information technology in the 21st century, the software, hardware, and facilities in schools as well as teaching resources have been gradually enriched, and e-learning platform came into being credited to the technology, which promoted the integration of virtual and practical teaching. In recent years, due to the COVID-19 pandemic, the teaching mode between teachers and students has changed from physical teaching to online teaching, gradually making online learning become the norm, which not only satisfied the students’ rights to receive education when they were subject to home quarantine, but also made the teaching interaction between teachers and students richer and diversified. The online learning platform could record the activities related to learning behaviors, such as checking teaching materials or taking online quiz in the log while students were learning, which is known as learning process record of students on the learning platform. These records contain key information that can affect learning results. Therefore, if a set of effective methods can be worked out to analyze the correlation between students’ learning behavior and academic performance during the use of online learning system, and to predict the learning results, teachers can provide early remedial teaching to students identified as poor in academic performance or intervene with their learning behaviors and properly adjust teaching contents, thereby improving their learning results. Nowadays, most scholars identify students’ learning results through data mining. However, if the traditional data mining classifier is used to classify minority class samples with poor learning results, all samples may be classified as good. The main reason is that when imbalanced data is used to construct classifiers, the learning rules of classifiers constructed by categories with majority class samples will be unfavorable to categories with minority class samples, making it possible for a small number of students with poor learning results to be judged as having good learning results. Some studies suggest the use of under-sampling to reduce the number of samples in majority class for the problem of imbalanced data. However, removing excessive data on a cluster basis may remove important information in majority class. Therefore, some scholars suggest the use of over-sampling to increase the number of samples in minority class. Synthetic Minority Oversampling Technique (SMOTE) proposed by Chawla et al. has been most widely used. The main purpose of this method is to randomly extract linear data points near minority class from the original data set as a new minority class data in order to improve the sample number of minority class. Additionally, Haibo He, Yang Bai, Edwardo A. Garcia, Shutao Li, et al came up with Adaptive Synthetic Sampling Approach (ADASYN) in 2008, which weighs every minority class samples. If a minority class samples has more neighboring majority class samples, the weight value of the sample will be higher. In this way, the number of samples to be synthesized for each minority class can be determined. SMOTE can oversampling all minority class samples, but not all minority class samples are indiscriminative. Among them, the minority class samples mixed with the majority class sample are the ones relatively indiscriminative. After oversampling, the minority class samples close to the boundary are mixed with the majority class samples, which is easy to produce noise. If the boundary samples are trained and learned, the majority class samples may be misjudged as minority class samples. Therefore, two data cleaning methods, Tomek Links and ENN, have been used to remove overlapping data after sampling, so as to improve the discrimination of minority class samples. In this study, the learning records of the Python programming course in the first semester of the academic year 2021 of National Central University were used to predict the learning effectiveness of the final exam. In the serialized data of the learning process, the students who have passed and failed showed great disparity in proportion. It’s likely to make the algorithm impossible to work in effective training and learning if the imbalanced data was used for algorithm training and prediction, thereby leading to the paradox of accuracy. In this study, imbalanced data are sampled by over-sampling to balance the proportion between minority class samples and majority class samples to solve the imbalanced data of learning process record and effectively predict students’ learning results. In addition, the overlapping samples between different classes are eliminated by under-sampling, so that the samples of the nearest neighbors belong to the same class, thereby improving the classification efficiency of classification algorithms. Then, six kinds of algorithms of SVM, Logistic Regression, Random Forest, KNN, Naive Bayes, and Decision Tree were used to predict the learning results, with the prediction results of each algorithm compared. From the study, it is observed that the data sets processed by different sampling methods have significantly improved in Accuracy, Recall, F1-score, AUC, G-mean, MCC, and other indicators compared with the raw data sets after classification and prediction by each algorithm. After the imbalanced data is processed by the sampling method, the classification problem of minority class samples in the algorithm can be effectively solved, the disastrous consequences in the training can be avoided, and the accuracy of the algorithm prediction can be significantly improved.	en_US
DC.subject	On-line Learning	zh_TW
DC.subject	Imbalanced data	zh_TW
DC.subject	Over-sampling	zh_TW
DC.subject	Under-sampling	zh_TW
DC.subject	學習歷程紀錄檔	zh_TW
DC.subject	學習成效預測	zh_TW
DC.subject	On-line Learning	en_US
DC.subject	Imbalanced data	en_US
DC.subject	Over-sampling	en_US
DC.subject	Under-sampling	en_US
DC.subject	learning history record	en_US
DC.subject	learning performance prediction	en_US
DC.title	不平衡數據建模採樣並使⽤機器學習早期預測學習成效	zh_TW
dc.language.iso	zh-TW	zh-TW
DC.title	Early Prediction of Learning Results through Modeling and Sampling with Imbalanced Data and Machine Learning.	en_US
DC.type	博碩士論文	zh_TW
DC.type	thesis	en_US
DC.publisher	National Central University	en_US

博碩士論文 108552001 完整後設資料紀錄