不平衡數據建模採樣並使⽤機器學習早期預測學習成效

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：58

、訪客IP：3.142.35.28

姓名

高璵勝(Yu-Sheng Kao) 查詢紙本館藏

畢業系所

資訊工程學系在職專班

論文名稱

不平衡數據建模採樣並使⽤機器學習早期預測學習成效
(Early Prediction of Learning Results through Modeling and Sampling with Imbalanced Data and Machine Learning.)

相關論文

★ 應用智慧分類法提升文章發佈效率於一企業之知識分享平台	★ 家庭智能管控之研究與實作
★ 開放式監控影像管理系統之搜尋機制設計及驗證	★ 資料探勘應用於呆滯料預警機制之建立
★ 探討問題解決模式下的學習行為分析	★ 資訊系統與電子簽核流程之總管理資訊系統
★ 製造執行系統應用於半導體機台停機通知分析處理	★ Apple Pay支付於iOS平台上之研究與實作
★ 應用集群分析探究學習模式對學習成效之影響	★ 應用序列探勘分析影片瀏覽模式對學習成效的影響
★ 一個以服務品質為基礎的網際服務選擇最佳化方法	★ 維基百科知識推薦系統對於使用e-Portfolio的學習者滿意度調查
★ 學生的學習動機、網路自我效能與系統滿意度之探討-以e-Portfolio為例	★ 藉由在第二人生內使用自動對話代理人來改善英文學習成效
★ 合作式資訊搜尋對於學生個人網路搜尋能力與策略之影響	★ 數位註記對學習者在線上學習環境中反思等級之影響

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 (2027-7-19以後開放)

摘要(中)

二十一世紀隨著資訊科技的進步，校園軟硬體設施及教學資陸續充實，數位學習平台在科技的輔助下應運而生，推升了教學虛實整合的應用。且近年在新冠疫情的催化下，教師與學生間的授課模式由實體授課轉變為線上教學，逐漸讓線上學習成為常態。不僅解決了學生居家隔離時的受教權益，也讓師生間的教學互動變得更豐富且多元。線上學習平台能在學生進行學習的同時，將學習行為相關的活動記錄於日誌內。例如瀏覽教材、線上測驗等，此即學生在學習平台上的學習歷程紀錄。這些紀錄隱含著影響學習成效的關鍵資訊。因此，若能找出一套有效的方法，分析學生的學習成績與使用線上學習系統期間的學習行為之相關性，並早期預測其學習成效。將有助於教師能提早給予被識別為高風險的學生進行學習輔導，或干預學生的學習行為與適度調整教學內容，以提升改善其學習效果。現今有多數學者透過資料探勘的方式來判斷學生的學習成效好壞。然而在學習成效不好的少數樣本下，若以傳統的資料探勘分類器進行分類，可能會將全部的樣本歸類為好的學習成效，主要因素是使用不平衡數據來建構分類器時，利用較多樣本數的類別來建構的分類器，其學習規則會不利於少量樣本數的類別。這使得少數學習成效不好的學生被判斷為學習成效好的機率可能大幅提升。有些研究對於不平衡數據的問題建議使用減少多數類別中樣本數的欠採樣法（Under-sampling），然而若以分群為基礎來刪除多餘的資料，這樣的方式可能會刪除多數類別中的重要資訊。因此有學者建議使用增加少數類別樣本的過採樣法（Over-sampling），來增加少量類別（Minority class）中的樣本數量。目前以 Chawla et al. 提出的合成少數類過採樣技術（Synthetic Minority Oversampling Technique, SMOTE）最常被使用，其主要是從原始數據集中隨機地取出少量類別附近的線性資料點成為新的少量類別資料，以改善少量類別的樣本數。另外，Haibo He, Yang Bai, Edwardo A. Garcia, Shutao Li等人在2008年提出自適應合成採樣方法（Adaptive synthetic sampling approach, ADASYN）。它給予每個少量類樣本具備各自的權重，若少數類樣本其相鄰的多數類樣本數越多，則該樣本的權重值越高，以此方式來確定每個少數類樣本需要合成的樣本數量。SMOTE雖能透過對所有少數樣本進行過採樣，但並非所有少數樣本都無鑑別度，其中與多數樣本混合的少數樣本才較無鑑別度。而進行過採樣後靠近邊界的少數樣本與多數樣本混合後，容易產生雜訊。若對邊界的樣本進行訓練學習，可能導致多數類樣本誤判為少數類樣本。因此有研究透過 Tomek Links、ENN兩種數據清洗方法來去除採樣後重疊的數據，進而提高少數樣本的識別度。本研究使用國立中央大學110學年度上學期Python程式設計課程的學習歷程紀錄來預測學生期末考的學習成效。其中學習歷程的序列化資料裡，及格與不及格學生的比例懸殊，若透過此不平衡數據來進行演算法訓練與預測，易導致演算法無法有效訓練學習，進而造成準確性悖論的現象。為了解決學習歷程紀錄數據不平衡的問題，並能夠有效提早預測學生學習成效，本研究透過Over-sampling進行不平衡數據的採樣處理，以平衡少數類樣本與多數類樣本間的比例。並且透過Under-sampling剔除不同類別間相互重疊的樣本，使最近鄰的樣本皆屬於同一類別，進而提升分類演算的分類效能。再透過SVM、Logistic Regression、Random Forest、KNN、Naïve Bayes及Decision Tree等六種演算法來進行學習成效預測，並比較各演算法的預測效果。從研究中觀察到透過不同採樣法處理後的數據集，經各演算法進行分類預測後，在 Accuracy、Recall、F1-score、AUC、G-mean、MCC等指標，相對未處理原始數據集均有明顯提升。不平衡數據透過採樣法處理後，能有效的解決少數類樣本在演算法的分類問題，避免在訓練時所帶來的災難性後果，且能大幅提升演算法預測的準確性。

摘要(英)

With the progress of information technology in the 21st century, the software, hardware, and facilities in schools as well as teaching resources have been gradually enriched, and e-learning platform came into being credited to the technology, which promoted the integration of virtual and practical teaching. In recent years, due to the COVID-19 pandemic, the teaching mode between teachers and students has changed from physical teaching to online teaching, gradually making online learning become the norm, which not only satisfied the students’ rights to receive education when they were subject to home quarantine, but also made the teaching interaction between teachers and students richer and diversified. The online learning platform could record the activities related to learning behaviors, such as checking teaching materials or taking online quiz in the log while students were learning, which is known as learning process record of students on the learning platform. These records contain key information that can affect learning results. Therefore, if a set of effective methods can be worked out to analyze the correlation between students’ learning behavior and academic performance during the use of online learning system, and to predict the learning results, teachers can provide early remedial teaching to students identified as poor in academic performance or intervene with their learning behaviors and properly adjust teaching contents, thereby improving their learning results. Nowadays, most scholars identify students’ learning results through data mining. However, if the traditional data mining classifier is used to classify minority class samples with poor learning results, all samples may be classified as good. The main reason is that when imbalanced data is used to construct classifiers, the learning rules of classifiers constructed by categories with majority class samples will be unfavorable to categories with minority class samples, making it possible for a small number of students with poor learning results to be judged as having good learning results. Some studies suggest the use of under-sampling to reduce the number of samples in majority class for the problem of imbalanced data. However, removing excessive data on a cluster basis may remove important information in majority class. Therefore, some scholars suggest the use of over-sampling to increase the number of samples in minority class. Synthetic Minority Oversampling Technique (SMOTE) proposed by Chawla et al. has been most widely used. The main purpose of this method is to randomly extract linear data points near minority class from the original data set as a new minority class data in order to improve the sample number of minority class. Additionally, Haibo He, Yang Bai, Edwardo A. Garcia, Shutao Li, et al came up with Adaptive Synthetic Sampling Approach (ADASYN) in 2008, which weighs every minority class samples. If a minority class samples has more neighboring majority class samples, the weight value of the sample will be higher. In this way, the number of samples to be synthesized for each minority class can be determined. SMOTE can oversampling all minority class samples, but not all minority class samples are indiscriminative. Among them, the minority class samples mixed with the majority class sample are the ones relatively indiscriminative. After oversampling, the minority class samples close to the boundary are mixed with the majority class samples, which is easy to produce noise. If the boundary samples are trained and learned, the majority class samples may be misjudged as minority class samples. Therefore, two data cleaning methods, Tomek Links and ENN, have been used to remove overlapping data after sampling, so as to improve the discrimination of minority class samples.
In this study, the learning records of the Python programming course in the first semester of the academic year 2021 of National Central University were used to predict the learning effectiveness of the final exam. In the serialized data of the learning process, the students who have passed and failed showed great disparity in proportion. It’s likely to make the algorithm impossible to work in effective training and learning if the imbalanced data was used for algorithm training and prediction, thereby leading to the paradox of accuracy. In this study, imbalanced data are sampled by over-sampling to balance the proportion between minority class samples and majority class samples to solve the imbalanced data of learning process record and effectively predict students’ learning results. In addition, the overlapping samples between different classes are eliminated by under-sampling, so that the samples of the nearest neighbors belong to the same class, thereby improving the classification efficiency of classification algorithms. Then, six kinds of algorithms of SVM, Logistic Regression, Random Forest, KNN, Naive Bayes, and Decision Tree were used to predict the learning results, with the prediction results of each algorithm compared. From the study, it is observed that the data sets processed by different sampling methods have significantly improved in Accuracy, Recall, F1-score, AUC, G-mean, MCC, and other indicators compared with the raw data sets after classification and prediction by each algorithm. After the imbalanced data is processed by the sampling method, the classification problem of minority class samples in the algorithm can be effectively solved, the disastrous consequences in the training can be avoided, and the accuracy of the algorithm prediction can be significantly improved.

關鍵字(中)

★ On-line Learning
★ Imbalanced data
★ Over-sampling
★ Under-sampling
★ 學習歷程紀錄檔
★ 學習成效預測

關鍵字(英)

★ On-line Learning
★ Imbalanced data
★ Over-sampling
★ Under-sampling
★ learning history record
★ learning performance prediction

論文目次

中文摘要 i
ABSTRACT iv
圖目錄 viii
圖目錄 x
表目錄 xi
一、緒論 1
二、文獻探討 3
2-1 自然語言處理（Natural Language Processing） 3
2-2 不平衡數據（Imbalanced Data） 4
2-3 採樣方法 5
2-3-1 Over-sampling 6
2-3-2 Under-sampling 7
2-3-3 Over-sampling + Under-sampling 9
2-4 機器學習模型介紹 10
2-4-1 支援向量機（Support Vector Machine, SVM） 10
2-4-2 邏輯迴歸（Logistic Regression） 10
2-4-3 隨機森林（Random Forest） 11
2-4-4 KNN（K-Nearest Neighbors） 12
2-4-5 貝葉斯模型（Naïve Bayes） 13
2-4-6 決策樹（Decision Tree） 14
2-5 衡量指標 14
2-5-1 AUC 17
2-5-2 F1-score 18
2-5-3 G-mean（Geometric Mean） 18
2-5-4 MCC（Matthews Correlation Coefficient） 19
三、研究內容與方法 20
3-1 資料集與前置處理 22
3-2 學習歷程紀錄的特徵提取及序列化 23
3-3 學習行為特徵向量處理 26
3-4 正負類樣本數 27
3-5 不同採樣方法下各模型的分類結果 30
四、研究結果與討論 37
4-1 不平衡數據透過各模型進行訓練預測結果 37
4-2 採樣後數據透過各模型進行訓練預測結果 38
五、結論與未來研究 41
研究限制 43
參考文獻 44

參考文獻

Ali, M., Khattak, A. M., Ali, Z., Hayat, B., Idrees, M., Pervez, Z., Rizwan, K., Sung, T.-E., & Kim, K.-I. (2021). Estimation and interpretation of machine learning models with customized surrogate model. Electronics, 10(23), 3045.
Altınçay, H., & Ergün, C. (2004). Clustering based under-sampling for improving speaker verification decisions using AdaBoost. Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR),
Barandela, R., Sánchez, J. S., Garcıa, V., & Rangel, E. (2003). Strategies for learning in class imbalance problems. Pattern Recognition, 36(3), 849-851.
Barros, T. M., Souza Neto, P. A., Silva, I., & Guedes, L. A. (2019). Predictive models for imbalanced data: a school dropout perspective. Education Sciences, 9(4), 275.
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16, 321-357.
Chen, Y., Hsieh, H., & Chen, N. (2003). Dynamic constructing decision rules from learning portfolio to support adaptive instruction. Institute of Information & Computing Machinery, 6(3), 11-24.
Drummond, C., & Holte, R. C. (2003). C4. 5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. Workshop on learning from imbalanced datasets II,
Estabrooks, A., Jo, T., & Japkowicz, N. (2004). A multiple resampling method for learning from imbalanced data sets. Computational intelligence, 20(1), 18-36.
Farquad, M. A. H., & Bose, I. (2012). Preprocessing unbalanced data using support vector machine. Decision Support Systems, 53(1), 226-233.
Galpert, D., Del Río, S., Herrera, F., Ancede-Gallardo, E., Antunes, A., & Agüero-Chapin, G. (2015). An effective big data supervised imbalanced classification approach for ortholog detection in related yeast species. BioMed research international, 2015.
Ganganwar, V. (2012). An overview of classification algorithms for imbalanced datasets. International Journal of Emerging Technology and Advanced Engineering, 2(4), 42-47.
Ge, S., Ye, J., & He, M. (2019). Prediction model of user purchase behavior based on deep forest. computer science, 46(09), 190-1944.
Guo, H., & Viktor, H. L. (2004). Learning from imbalanced data sets with boosting and data generation: the databoost-im approach. ACM Sigkdd Explorations Newsletter, 6(1), 30-39.
Haixiang, G., Yijing, L., Shang, J., Mingyun, G., Yuanyue, H., & Bing, G. (2017). Learning from class-imbalanced data: Review of methods and applications. Expert systems with applications, 73, 220-239.
Hasib, K. M., Iqbal, M., Shah, F. M., Mahmud, J. A., Popel, M. H., Showrov, M., Hossain, I., Ahmed, S., & Rahman, O. (2020). A survey of methods for managing the classification and solution of data imbalance problem. arXiv preprint arXiv:2012.11870.
He, H., Bai, Y., Garcia, E. A., & Li, S. (2008). ADASYN: Adaptive synthetic sampling approach for imbalanced learning. 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence),
He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on knowledge and data engineering, 21(9), 1263-1284.
Kang, Q., Chen, X., Li, S., & Zhou, M. (2016). A noise-filtered under-sampling scheme for imbalanced classification. IEEE transactions on cybernetics, 47(12), 4263-4274.
Karakoulas, G., & Shawe-Taylor, J. (1998). Optimizing classifers for imbalanced training sets. Advances in neural information processing systems, 11.
Khalilia, M., Chakraborty, S., & Popescu, M. (2011). Predicting disease risks from highly imbalanced data using random forest. BMC medical informatics and decision making, 11(1), 1-13.
Kubat, M., & Matwin, S. (1997). Addressing the curse of imbalanced data sets: One-sided sampling. Proceedings of the fourteenth international conference on machine learning,
Li, D.-C., Chen, C.-C., Chang, C.-J., & Lin, W.-K. (2012). A tree-based-trend-diffusion prediction procedure for small sample sets in the early stages of manufacturing systems. Expert Systems with Applications, 39(1), 1575-1581.
Li, D.-C., Lin, L.-S., & Peng, L.-J. (2014). Improving learning accuracy by using synthetic samples for small datasets with non-linear attribute dependency. Decision Support Systems, 59, 286-295.
Li, D.-C., Liu, C.-W., & Chen, W.-C. (2012). A multi-model approach to determine early manufacturing parameters for small-data-set prediction. International journal of production research, 50(23), 6679-6690.
Li, S., & Liu, T. (2021). Performance prediction for higher education students using deep learning. Complexity, 2021.
Liu, X.-Y., Wu, J., & Zhou, Z.-H. (2008). Exploratory undersampling for class-imbalance learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 39(2), 539-550.
Lua, O. H., Huangb, A. Y., Kuob, C.-Y., Chenc, I. Y., & Yangb, S. J. (2020). Sequence Pattern Mining for the Identification of Reading Behavior based on SQ3R Reading Strategy.
Mani, I., & Zhang, I. (2003). kNN approach to unbalanced data distributions: a case study involving information extraction. Proceedings of workshop on learning from imbalanced datasets,
Pan, J., Sheng, W., & Dey, S. (2019). Order matters at fanatics recommending sequentially ordered products by LSTM embedded with Word2Vec. arXiv preprint arXiv:1911.09818.
Patil, A. P., Ganesan, K., & Kanavalli, A. (2017). Effective deep learning model to predict student grade point averages. 2017 IEEE International Conference on Computational Intelligence and Computing Research (ICCIC),
Qing, Z., Zeng, Q., Wang, H., Liu, Y., Xiong, T., & Zhang, S. (2022). ADASYN-LOF Algorithm for Imbalanced Tornado Samples. Atmosphere, 13(4), 544.
Shelke, M. S., Deshmukh, P. R., & Shandilya, V. K. (2017). A review on imbalanced data handling using undersampling and oversampling technique. Int. J. Recent Trends Eng. Res, 3(4), 444-449.
Sun, Y., Wong, A. K., & Kamel, M. S. (2009). Classification of imbalanced data: A review. International journal of pattern recognition and artificial intelligence, 23(04), 687-719.
Wang, J., Zhao, C., He, S., Gu, Y., Alfarraj, O., & Abugabah, A. (2022). LogUAD: Log Unsupervised Anomaly Detection Based on Word2Vec. Comput. Syst. Sci. Eng., 41(3), 1207-1222.
Wilson, D. L. (1972). Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man, and Cybernetics(3), 408-421.
Yen, S.-J., & Lee, Y.-S. (2006). Under-sampling approaches for improving prediction of the minority class in an imbalanced dataset. In Intelligent Control and Automation (pp. 731-740). Springer.
Yoon, K., & Kwek, S. (2007). A data reduction approach for resolving the imbalanced data issue in functional genomics. Neural Computing and Applications, 16(3), 295-306.
Zhang, H., & Wang, Z. (2011). A normal distribution-based over-sampling approach to imbalanced data classification. International conference on advanced data mining and applications,
Zhang, Y.-P., Zhang, L.-N., & Wang, Y.-C. (2010). Cluster-based majority under-sampling approaches for class imbalance learning. 2010 2nd IEEE International Conference on Information and Financial Engineering,
江羿臻, & 林正昌. (2014). 應用決策樹探討中學生學習成就的相關因素 [Applying Decision Tree to Investigate High School Students′ Learning Achievement Factors]. 教育心理學報, 45(3), 303-327.
胡詠翔. (2019). 大規模開放線上課程學習分析促進科技學科教學知識之研究 [Applying Learning Analytics to Enhance the Technological Pedagogical Content Knowledge of Teachers Teaching Massive Open Online Courses]. 教學實踐與創新, 2(1), 77-114.

指導教授

楊鎮華(Stephen J.H. Yang)

審核日期

2022-8-13

推文