摘要(英) |
In financial, telecom, or medical industry, classification problems are ubiquitous. For example, the financial industry predicts a depositor′s credit rating based on several input variables such as age, annual income, education, and repayment history, where the responses are qualitative. More and more deep learning models are developed for such purposes, reflecting the importance of classification problems. On the other hand, with the rapid growth of data size given limited computer resources, various data reduction methods have been innovated. In this thesis, we utilize a concept of data reduction to develop a classification predictor. We illustrate the proposed method through simulations and real examples. |
參考文獻 |
Chenlu Shi, and Boxin Tang (2021). Model-robust subdata selection for big data, Journal of Statistical Theory and Practice. 15(82).
Elizabeth D Schifano, Jing Wu, Chun Wang, Jun Yan, and Ming-Hui Chen (2016). Online updating of statistical inference in the big data setting, Technometrics, 58(3), 393–403.
Erchin Serpedin, Thomas Chen and Dinesh Rajan (2012). Mathematical Foundations for Signal Processing, Communications, and Networking, CRC Press, 381-385.
Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani (2013). An Introduction to Statistical Learning : with Applications in R, Springer, New York, NY.
HaiYing Wang, Min Yang, and John Stufken (2018). Information-based optimal subdata selection for big data linear regression, Journal of The American Statistical Association, 114(525), 393-405.
HaiYing Wang, Rong Zhu, and Ping Ma (2018). Optimal subsampling for large sample logistic regression, Journal of The American Statistical Association, 113(522), 829–844.
Leo Breiman (2001). Random forests, Machine Learning, 45, 5-32.
Lin Wang, Jake Elmstedt, Weng Kee Wong, and Hongquan Xu (2021). Orthogonal subsampling for big data linear regression, Annals of Applied Statistics, 15(3), 1273-1290.
Nan Lin, and Ruibin Xi (2011). Aggregated estimating equation estimation, Statistics and Its Interface, 4(1), 73–83.
Petros Drineas, Michael W. Mahoney, S. Muthukrishnan (2006). Sampling algorithms for l2 regression and applications, SODA ’06: Proceedings of The Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithm, 1127-1136. 57
Trevor Hastie, Robert Tibshirani, and Jerome Friedman (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction (Second Edition), Springer-Verlag.
V. Roshan Joseph and Akhil Vakayil (2021). SPlit: an optimal method for data splitting, Technometrics, 64(2), 166-176.
V. Roshan Joseph, and Simon Mak (2021). Supervised compression of big data, Statistical Analysis and Data Mining, 14(3), 217-229.
William Fithian and Trevor Hastie (2014). Local case-control sampling: efficient subsampling in imbalanced data sets, Annals of Statistics, 42(5), 1693–1724.
Yaqiong Yao, and HaiYing Wang (2020). A review on optimal subsampling methods for massive datasets, Journal of Data Science, 19(1), 151–172.
Yaqiong Yao, and Ying Wang (2021). A selective review on statistical techniques for big data, Modern Statistical Methods for Health Research, 223-245.
Zizhu Fan, Yong Xu, and David Zhang (2011). Local linear discriminant analysis framework using sample neighbors, IEEE Transactions on Neural Networks, 22(7), 1119-1132. |