摘要(英) |
With the progress of information technology in the 21st century, the software, hardware, and facilities in schools as well as teaching resources have been gradually enriched, and e-learning platform came into being credited to the technology, which promoted the integration of virtual and practical teaching. In recent years, due to the COVID-19 pandemic, the teaching mode between teachers and students has changed from physical teaching to online teaching, gradually making online learning become the norm, which not only satisfied the students’ rights to receive education when they were subject to home quarantine, but also made the teaching interaction between teachers and students richer and diversified. The online learning platform could record the activities related to learning behaviors, such as checking teaching materials or taking online quiz in the log while students were learning, which is known as learning process record of students on the learning platform. These records contain key information that can affect learning results. Therefore, if a set of effective methods can be worked out to analyze the correlation between students’ learning behavior and academic performance during the use of online learning system, and to predict the learning results, teachers can provide early remedial teaching to students identified as poor in academic performance or intervene with their learning behaviors and properly adjust teaching contents, thereby improving their learning results. Nowadays, most scholars identify students’ learning results through data mining. However, if the traditional data mining classifier is used to classify minority class samples with poor learning results, all samples may be classified as good. The main reason is that when imbalanced data is used to construct classifiers, the learning rules of classifiers constructed by categories with majority class samples will be unfavorable to categories with minority class samples, making it possible for a small number of students with poor learning results to be judged as having good learning results. Some studies suggest the use of under-sampling to reduce the number of samples in majority class for the problem of imbalanced data. However, removing excessive data on a cluster basis may remove important information in majority class. Therefore, some scholars suggest the use of over-sampling to increase the number of samples in minority class. Synthetic Minority Oversampling Technique (SMOTE) proposed by Chawla et al. has been most widely used. The main purpose of this method is to randomly extract linear data points near minority class from the original data set as a new minority class data in order to improve the sample number of minority class. Additionally, Haibo He, Yang Bai, Edwardo A. Garcia, Shutao Li, et al came up with Adaptive Synthetic Sampling Approach (ADASYN) in 2008, which weighs every minority class samples. If a minority class samples has more neighboring majority class samples, the weight value of the sample will be higher. In this way, the number of samples to be synthesized for each minority class can be determined. SMOTE can oversampling all minority class samples, but not all minority class samples are indiscriminative. Among them, the minority class samples mixed with the majority class sample are the ones relatively indiscriminative. After oversampling, the minority class samples close to the boundary are mixed with the majority class samples, which is easy to produce noise. If the boundary samples are trained and learned, the majority class samples may be misjudged as minority class samples. Therefore, two data cleaning methods, Tomek Links and ENN, have been used to remove overlapping data after sampling, so as to improve the discrimination of minority class samples.
In this study, the learning records of the Python programming course in the first semester of the academic year 2021 of National Central University were used to predict the learning effectiveness of the final exam. In the serialized data of the learning process, the students who have passed and failed showed great disparity in proportion. It’s likely to make the algorithm impossible to work in effective training and learning if the imbalanced data was used for algorithm training and prediction, thereby leading to the paradox of accuracy. In this study, imbalanced data are sampled by over-sampling to balance the proportion between minority class samples and majority class samples to solve the imbalanced data of learning process record and effectively predict students’ learning results. In addition, the overlapping samples between different classes are eliminated by under-sampling, so that the samples of the nearest neighbors belong to the same class, thereby improving the classification efficiency of classification algorithms. Then, six kinds of algorithms of SVM, Logistic Regression, Random Forest, KNN, Naive Bayes, and Decision Tree were used to predict the learning results, with the prediction results of each algorithm compared. From the study, it is observed that the data sets processed by different sampling methods have significantly improved in Accuracy, Recall, F1-score, AUC, G-mean, MCC, and other indicators compared with the raw data sets after classification and prediction by each algorithm. After the imbalanced data is processed by the sampling method, the classification problem of minority class samples in the algorithm can be effectively solved, the disastrous consequences in the training can be avoided, and the accuracy of the algorithm prediction can be significantly improved. |
參考文獻 |
Ali, M., Khattak, A. M., Ali, Z., Hayat, B., Idrees, M., Pervez, Z., Rizwan, K., Sung, T.-E., & Kim, K.-I. (2021). Estimation and interpretation of machine learning models with customized surrogate model. Electronics, 10(23), 3045.
Altınçay, H., & Ergün, C. (2004). Clustering based under-sampling for improving speaker verification decisions using AdaBoost. Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR),
Barandela, R., Sánchez, J. S., Garcıa, V., & Rangel, E. (2003). Strategies for learning in class imbalance problems. Pattern Recognition, 36(3), 849-851.
Barros, T. M., Souza Neto, P. A., Silva, I., & Guedes, L. A. (2019). Predictive models for imbalanced data: a school dropout perspective. Education Sciences, 9(4), 275.
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16, 321-357.
Chen, Y., Hsieh, H., & Chen, N. (2003). Dynamic constructing decision rules from learning portfolio to support adaptive instruction. Institute of Information & Computing Machinery, 6(3), 11-24.
Drummond, C., & Holte, R. C. (2003). C4. 5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. Workshop on learning from imbalanced datasets II,
Estabrooks, A., Jo, T., & Japkowicz, N. (2004). A multiple resampling method for learning from imbalanced data sets. Computational intelligence, 20(1), 18-36.
Farquad, M. A. H., & Bose, I. (2012). Preprocessing unbalanced data using support vector machine. Decision Support Systems, 53(1), 226-233.
Galpert, D., Del Río, S., Herrera, F., Ancede-Gallardo, E., Antunes, A., & Agüero-Chapin, G. (2015). An effective big data supervised imbalanced classification approach for ortholog detection in related yeast species. BioMed research international, 2015.
Ganganwar, V. (2012). An overview of classification algorithms for imbalanced datasets. International Journal of Emerging Technology and Advanced Engineering, 2(4), 42-47.
Ge, S., Ye, J., & He, M. (2019). Prediction model of user purchase behavior based on deep forest. computer science, 46(09), 190-1944.
Guo, H., & Viktor, H. L. (2004). Learning from imbalanced data sets with boosting and data generation: the databoost-im approach. ACM Sigkdd Explorations Newsletter, 6(1), 30-39.
Haixiang, G., Yijing, L., Shang, J., Mingyun, G., Yuanyue, H., & Bing, G. (2017). Learning from class-imbalanced data: Review of methods and applications. Expert systems with applications, 73, 220-239.
Hasib, K. M., Iqbal, M., Shah, F. M., Mahmud, J. A., Popel, M. H., Showrov, M., Hossain, I., Ahmed, S., & Rahman, O. (2020). A survey of methods for managing the classification and solution of data imbalance problem. arXiv preprint arXiv:2012.11870.
He, H., Bai, Y., Garcia, E. A., & Li, S. (2008). ADASYN: Adaptive synthetic sampling approach for imbalanced learning. 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence),
He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on knowledge and data engineering, 21(9), 1263-1284.
Kang, Q., Chen, X., Li, S., & Zhou, M. (2016). A noise-filtered under-sampling scheme for imbalanced classification. IEEE transactions on cybernetics, 47(12), 4263-4274.
Karakoulas, G., & Shawe-Taylor, J. (1998). Optimizing classifers for imbalanced training sets. Advances in neural information processing systems, 11.
Khalilia, M., Chakraborty, S., & Popescu, M. (2011). Predicting disease risks from highly imbalanced data using random forest. BMC medical informatics and decision making, 11(1), 1-13.
Kubat, M., & Matwin, S. (1997). Addressing the curse of imbalanced data sets: One-sided sampling. Proceedings of the fourteenth international conference on machine learning,
Li, D.-C., Chen, C.-C., Chang, C.-J., & Lin, W.-K. (2012). A tree-based-trend-diffusion prediction procedure for small sample sets in the early stages of manufacturing systems. Expert Systems with Applications, 39(1), 1575-1581.
Li, D.-C., Lin, L.-S., & Peng, L.-J. (2014). Improving learning accuracy by using synthetic samples for small datasets with non-linear attribute dependency. Decision Support Systems, 59, 286-295.
Li, D.-C., Liu, C.-W., & Chen, W.-C. (2012). A multi-model approach to determine early manufacturing parameters for small-data-set prediction. International journal of production research, 50(23), 6679-6690.
Li, S., & Liu, T. (2021). Performance prediction for higher education students using deep learning. Complexity, 2021.
Liu, X.-Y., Wu, J., & Zhou, Z.-H. (2008). Exploratory undersampling for class-imbalance learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 39(2), 539-550.
Lua, O. H., Huangb, A. Y., Kuob, C.-Y., Chenc, I. Y., & Yangb, S. J. (2020). Sequence Pattern Mining for the Identification of Reading Behavior based on SQ3R Reading Strategy.
Mani, I., & Zhang, I. (2003). kNN approach to unbalanced data distributions: a case study involving information extraction. Proceedings of workshop on learning from imbalanced datasets,
Pan, J., Sheng, W., & Dey, S. (2019). Order matters at fanatics recommending sequentially ordered products by LSTM embedded with Word2Vec. arXiv preprint arXiv:1911.09818.
Patil, A. P., Ganesan, K., & Kanavalli, A. (2017). Effective deep learning model to predict student grade point averages. 2017 IEEE International Conference on Computational Intelligence and Computing Research (ICCIC),
Qing, Z., Zeng, Q., Wang, H., Liu, Y., Xiong, T., & Zhang, S. (2022). ADASYN-LOF Algorithm for Imbalanced Tornado Samples. Atmosphere, 13(4), 544.
Shelke, M. S., Deshmukh, P. R., & Shandilya, V. K. (2017). A review on imbalanced data handling using undersampling and oversampling technique. Int. J. Recent Trends Eng. Res, 3(4), 444-449.
Sun, Y., Wong, A. K., & Kamel, M. S. (2009). Classification of imbalanced data: A review. International journal of pattern recognition and artificial intelligence, 23(04), 687-719.
Wang, J., Zhao, C., He, S., Gu, Y., Alfarraj, O., & Abugabah, A. (2022). LogUAD: Log Unsupervised Anomaly Detection Based on Word2Vec. Comput. Syst. Sci. Eng., 41(3), 1207-1222.
Wilson, D. L. (1972). Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man, and Cybernetics(3), 408-421.
Yen, S.-J., & Lee, Y.-S. (2006). Under-sampling approaches for improving prediction of the minority class in an imbalanced dataset. In Intelligent Control and Automation (pp. 731-740). Springer.
Yoon, K., & Kwek, S. (2007). A data reduction approach for resolving the imbalanced data issue in functional genomics. Neural Computing and Applications, 16(3), 295-306.
Zhang, H., & Wang, Z. (2011). A normal distribution-based over-sampling approach to imbalanced data classification. International conference on advanced data mining and applications,
Zhang, Y.-P., Zhang, L.-N., & Wang, Y.-C. (2010). Cluster-based majority under-sampling approaches for class imbalance learning. 2010 2nd IEEE International Conference on Information and Financial Engineering,
江羿臻, & 林正昌. (2014). 應用決策樹探討中學生學習成就的相關因素 [Applying Decision Tree to Investigate High School Students′ Learning Achievement Factors]. 教育心理學報, 45(3), 303-327.
胡詠翔. (2019). 大規模開放線上課程學習分析促進科技學科教學知識之研究 [Applying Learning Analytics to Enhance the Technological Pedagogical Content Knowledge of Teachers Teaching Massive Open Online Courses]. 教學實踐與創新, 2(1), 77-114. |