博碩士論文 109453032 詳細資訊




以作者查詢圖書館館藏 以作者查詢臺灣博碩士 以作者查詢全國書目 勘誤回報 、線上人數:10 、訪客IP:3.137.171.121
姓名 王耘(Yun Wang)  查詢紙本館藏   畢業系所 資訊管理學系在職專班
論文名稱 應用機器學習建立單位健保欠費催繳後繳納預測模型
(Using Machine Learning to build a Prediction Model for NHI Premium Payment after Arrear Reminder of Insured Units)
相關論文
★ 利用資料探勘技術建立商用複合機銷售預測模型★ 應用資料探勘技術於資源配置預測之研究-以某電腦代工支援單位為例
★ 資料探勘技術應用於航空業航班延誤分析-以C公司為例★ 全球供應鏈下新產品的安全控管-以C公司為例
★ 資料探勘應用於半導體雷射產業-以A公司為例★ 應用資料探勘技術於空運出口貨物存倉時間預測-以A公司為例
★ 使用資料探勘分類技術優化YouBike運補作業★ 特徵屬性篩選對於不同資料類型之影響
★ 資料探勘應用於B2B網路型態之企業官網研究-以T公司為例★ 衍生性金融商品之客戶投資分析與建議-整合分群與關聯法則技術
★ 應用卷積式神經網路建立肝臟超音波影像輔助判別模型★ 基於卷積神經網路之身分識別系統
★ 能源管理系統電能補值方法誤差率比較分析★ 企業員工情感分析與管理系統之研發
★ 資料淨化於類別不平衡問題: 機器學習觀點★ 資料探勘技術應用於旅客自助報到之分析—以C航空公司為例
檔案 [Endnote RIS 格式]    [Bibtex 格式]    [相關文章]   [文章引用]   [完整記錄]   [館藏目錄]   至系統瀏覽論文 ( 永不開放)
摘要(中) 為確保全民健保永續經營,凡符合投保資格的民眾皆應加保並負起繳納保費義務,全民納保率並已達99.9%,惟保費收繳率卻低於此,出現有投保卻未繳費之不公平情形,因此,應積極處理欠費議題,在有限的行政經費資源下,將資源有效利用,發揮最大的保費收回效益,並促使全民負起加保即應繳納保費之義務。
  是以,本研究希能透過機器學習方法精準找出能有效實施提升保費收繳率對策之對象,茲以健保署北區業務組之保費年度為108年的投保單位欠費資料為研究對象,作為建立預測模型之訓練資料集,透過未簡化維度及以特徵選取(資訊增益、基因演算法)簡化維度,以22項維度進行分析,包括欠費特徵3項、單位特徵13項及負責人特徵6項,再分別以單一分類器(CART決策樹、多層感知器、支援向量機)及集成式學習(隨機森林、Bagging及AdaBoost)建立投保單位健保欠費催繳後繳納預測模型。
  本研究建立之預測模型係用以預測當投保單位欠費經催繳後,其截至寬限期後一年內繳納與否之情形,並透過建立的預測模型提出建議改善策略,以更精準的方式進行催繳,即針對預測為期間內不繳納之欠費,且原先以平信寄發催繳通知者,逕改以雙掛號催繳,不僅可節省平信寄發郵資,更重要的是,將雙掛號的送達時程提早至少四個月,加速後續行政執行流程,方確保優先受償,如此,透過強化是類案件之行政執行前之催繳作業流程,促使該筆欠費債權回收的機率提高,及早把握欠費投保單位受償先機。
  經比較各分類器ROC曲線下面積之AUC數值及模型建立時間,以隨機森林表現最佳,其次依序為Boosting結合CART、Bagging結合CART及單一分類器CART,顯示集成式學習確實較單一分類器的效益為佳。而隨機森林模型中,不論是未簡化維度、以資訊增益簡化維度或以基因演算法簡化維度,AUC數值皆達0.974,即具有極佳的鑑別力,且經T檢定判定三者無顯著差異。而多層感知器及支援向量機則囿於本研究資料量較大,致其AUC數值相對較差,且模型運算建立時間也較久,故用於本研究資料集中之表現較差。
  本研究為進一步驗證各模型對未來新年度資料之預測效果的表現情形,茲以保費年月為109年1月及2月(觀察期間至保費繳納寬限期110年4月15日)的投保單位欠費資料作為測試資料集,研究結果顯示在隨機森林預測模型中,以資訊增益簡化維度的AUC數值0.828為最佳,仍具有優良的鑑別力,雖僅較未簡化維度的AUC數值0.827微高,但由於透過特徵選取能簡化維度,不僅能減少儲存空間,建立模型也相對快一些,為整體效益最好的分類預測模型,希本研究結果能提供健保署作為即早進行欠費監控之選案依據,達到提升保費收回的效果,對健保的永續發展發揮相當助益。
摘要(英) To ensure the sustainability of NHI, all citizens who meet insurance qualifications should be insured and pay premiums. The universal coverage rate has reached 99.9%, but the premium collection rate is lower than this. Therefore, we should deal with the issue of arrears actively. It can not only make effective use of resources under limited administrative funds, maximize the recovery of arrears, but also urge insured to assume the obligation to pay premiums.
Therefore, the research aims to identify objects accurately which can be implemented strategies effectively to increase the premium collection rate through machine learning. The object of the research is the arrear data in 2019 of the insured units of northern division of NHIA, which is the training dataset of the prediction model. The prediction model includes no dimension reduction and dimension reduction by feature selection (information gain, genetic algorithm), and analyzes with 22 dimensions, including 3 features of arrear, 13 features of insured unit and 6 features of the person in charge. Then, the single classifier (CART decision tree, multi-layer perceptron and support vector machine) and ensemble learning (random forest, Bagging and AdaBoost) were used to build the prediction model for NHI premium payment after arrear reminder of insured units.
The classifier model is used to predict whether the insured unit will pay premium within one year after the grace period after arrear reminder. To send the urge reminder in a more accurate way, we propose an improvement strategy for those predicted not to pay the arrears within one year after the grace period, which is to send the arrear reminder by double registered mail instead of original mail. This strategy can not only save the postage for ordinary mail, but more importantly, achieve the effect of delivery at least 4 months earlier, so the subsequent administrative execution process would be accelerated to ensure the priority of compensation which can increase the probability of the premium collection rate.
To compare the AUC value and the model building time of each classifier, it shows the random forest performs the best, followed by Boosting combined with CART, bagging combined with CART, and CART. That is, ensemble learning is indeed better than single classifier. In the random forest model, whether the dimension is simplified or not, the AUC value all reach 0.974, which have excellent discrimination, and the T test shows that there is no significant difference. On the other hand, the multi-layer perceptron and support vector machine perform relatively poor due to the large amount of the dataset.
In order to verify the prediction performance of the new data, the arrear data of January 2020 and February 2020 is used as test dataset in the study. And the result shows that among the random forest model, the information gain performs the best as 0.828 of AUC, which just greater than no dimension reduction 0.827 of AUC slightly. However, the dimensions can be reduced through information gain of feature selection, so it can not only reduce the storage space, but also build models relatively quickly. Overall, the random forest model used information gain is the best classification prediction model. Moreover, the results of the study can be provided to NHIA as an basis for monitoring arrears to improve the premium collection rate.
關鍵字(中) ★ 健保欠費
★ 投保單位
★ 機器學習
★ 分類預測
★ 特徵選取
關鍵字(英) ★ Arrear of NHI Premium
★ Insured Unit
★ Machine Learning
★ Classification Prediction
★ Feature Selection
論文目次 摘要 i
Abstract iii
目錄 v
表目錄 viii
圖目錄 ix
第一章 緒論 1
1.1 研究背景 1
1.2 研究動機 3
1.3 研究目的 6
第二章 文獻探討 8
2.1 健保單位欠費 8
2.2 其他財務風險 9
2.3 其他分類問題 11
2.4 機器學習-特徵選取(Feature Selection) 12
2.4.1 資訊增益(Information Gain) 12
2.4.2 基因演算法(Genetic Algorithm) 13
2.5 機器學習-建立預測模型 14
2.5.1 單一分類器 14
2.5.1.1 決策樹(DT) 14
2.5.1.2 類神經網路(ANN) 15
2.5.1.3 支援向量機(SVM) 16
2.5.2 集成式學習(Ensemble Learning) 17
2.5.2.1 套袋法(Bagging) 17
2.5.2.2 提升法(Boosting) 17
2.5.2.3 隨機森林(RF) 18
第三章 研究方法 19
3.1 研究方法及架構 19
3.2 資料來源 21
3.2.1 分類標記 22
3.2.2 變數處理 24
3.3 研究方法參數 26
第四章 研究結果與分析 29
4.1 資料前處理 29
4.1.1 剔除離群值 29
4.1.2 標準化 31
4.1.3 正規化 31
4.2 模型評估指標 32
4.3 建立預測模型 35
4.3.1 未簡化維度 36
4.3.2 以資訊增益(Information Gain)簡化維度 38
4.3.3 以基因演算法(Genetic Algorithm) 簡化維度 41
4.4 獨立樣本T檢定 43
4.5 模型驗證 46
4.6 小結 48
第五章 研究結論與建議 49
5.1 研究結論 49
5.2 研究貢獻 50
5.3 未來研究方向與建議 51
參考文獻 53
附錄一 Weka執行結果-建立預測模型 58
附錄二 Weka執行結果-隨機森林(十摺交叉驗證法)之AUC 75
附錄三 Weka執行結果-模型驗證 77
參考文獻 [1] 衛生福利部中央健康保險署,2019-2020全民健康保險年報,衛生福利部中央健康保險署,2018年12月。
[2] 衛生福利部中央健康保險署,「110年1月份全民健康保險業務執行報告」,衛生福利部全民健康保險會第5屆110年第1次委員會議,5-22頁,2021年1月。
[3] 司法院大法官釋字第472號。
[4] Harrington, Peter. “Machine learning in action.”, Simon and Schuster, 2012.
[5] 曾婉菁,機器學習探究,印刷科技,2018。
[6] Samuel, A. L., “Some studies in machine learning using the game of checkers.”, IBM Journal of research and development, 3(3), pp. 210-229, 1959.
[7] 陳淑雲等,「健保費欠費經行政執行各階段收回成效之探討-以中區業務組投保單位為例」,衛生福利部研究發展計畫,2012。
[8] 鄭舒琪,「運用健保資料庫分析全民健保第一類投保單位欠費特性」,國立交通大學,碩士論文,2017。
[9] 江碧君等,「健保解卡對民營機構欠費及其負責人醫療利用影響」,衛生福利部研究發展計畫,2019。
[10] 陳雅珊等,「影響欠費單位移送執行之模型及其評估-以高屏業務組為例」,衛生福利部研究發展計畫,2019。
[11] Sun, Z., Wiering, M. A., & Petkov, N., “Classification system for mortgage arrear management.”, 2014 IEEE Conference on Computational Intelligence for Financial Engineering & Economics (CIFEr), pp. 489-496, IEEE, March 2014.
[12] Wang, J. M., & Wen, Y. Q., “Application of data mining in arrear risks prediction of power customer.”, 2008 International Symposium on Knowledge Acquisition and Modeling, pp. 206-210, IEEE, December 2008.
[13] Feldman, D., & Gross, S., “Mortgage default: classification trees analysis.”, The Journal of Real Estate Finance and Economics, 30(4), pp. 369-396, 2005.
[14] 鄭茂松,「利用資料探勘技術建立破產預測模型」,國立中央大學,碩士論文,2016。
[15] Lee, T. S., Chiu, C. C., Chou, Y. C., & Lu, C. J., “Mining the customer credit using classification and regression tree and multivariate adaptive regression splines.”, Computational Statistics & Data Analysis, 50(4), pp. 1113-1130, 2006.
[16] Baesens, B., Van Gestel, T., Viaene, S., Stepanova, M., Suykens, J., & Vanthienen, J., “Benchmarking state-of-the-art classification algorithms for credit scoring.”, Journal of the operational research society, 54(6), pp. 627-635, 2003.
[17] Atiya, A. F., “Bankruptcy prediction for credit risk using neural networks: A survey and new results.”, IEEE Transactions on neural networks, 12(4), pp. 929-935, 2001.
[18] Kurt, I., Ture, M., & Kurum, A. T., “Comparing performances of logistic regression, classification and regression tree, and neural networks for predicting coronary artery disease.”, Expert systems with applications, 34(1), pp. 366-374, 2008.
[19] Naraei, P., Abhari, A., & Sadeghian, A., “Application of multilayer perceptron neural networks and support vector machines in classification of healthcare data.”, 2016 Future Technologies Conference (FTC), pp. 848-852, IEEE, December 2016.
[20] Pal, M., “Random forest classifier for remote sensing classification.”, International journal of remote sensing, 26(1), pp. 217-222, 2005.
[21] Ribeiro, M. H. D. M., & dos Santos Coelho, L., “Ensemble approach based on bagging, boosting and stacking for short-term prediction in agribusiness time series.”, Applied Soft Computing, 86, 105837, 2020.
[22] Chan, J. C. W., & Paelinckx, D., “Evaluation of Random Forest and Adaboost tree-based ensemble classification and spectral band selection for ecotope mapping using airborne hyperspectral imagery.”, Remote Sensing of Environment, 112(6), pp. 2999-3011, 2008.
[23] Kira, K., & Rendell, L. A., “A practical approach to feature selection.”, Machine learning proceedings 1992, pp. 249-256, Morgan Kaufmann, 1992.
[24] Shannon, C. E., “A mathematical theory of communication.”, The Bell system technical journal, 27(3), pp. 379-423, 1948.
[25] Holland, J., “Adaptation in natural and artificial systems: an introductory analysis with application to biology.”, Control and artificial intelligence, 1975.
[26] Cunningham, P., Cord, M., & Delany, S. J., “Supervised learning.”, Machine learning techniques for multimedia, pp. 21-49, Springer, Berlin, Heidelberg, 2008.
[27] Safavian, S. R., & Landgrebe, D., “A survey of decision tree classifier methodology.”, IEEE transactions on systems, man, and cybernetics, 21(3), pp. 660-674, 1991.
[28] Song, Y. Y., & Ying, L. U., “Decision tree methods: applications for classification and prediction.”, Shanghai archives of psychiatry, 27(2), pp. 130-135, 2015.
[29] Breiman, L., Friedman, J., Stone, C. J., & Olshen, R. A., “Classification and regression trees.”, CRC press, 1984.
[30] McCulloch, W. S., & Pitts, W., “A logical calculus of the ideas immanent in nervous activity.”, The bulletin of mathematical biophysics, 5(4), pp. 115-133, 1943.
[31] Werbos, P., “Beyond regression:" new tools for prediction and analysis in the behavioral sciences.”, Ph. D. dissertation, Harvard University, 1974.
[32] McClelland, J. L., Rumelhart, D. E., & PDP Research Group., Parallel distributed processing, Vol. 2, pp. 20-21, Cambridge, MA: MIT press, 1986.
[33] Gardner, M. W., & Dorling, S. R., “Artificial neural networks (the multilayer perceptron)—a review of applications in the atmospheric sciences.”, Atmospheric environment, 32(14-15), pp. 2627-2636, 1998.
[34] Cortes, C., & Vapnik, V., “Support-vector networks. Machine learning”, 20(3), pp. 273-297, 1995.
[35] Meyer, D., Leisch, F., & Hornik, K., “The support vector machine under test.”, Neurocomputing, 55(1-2), pp. 169-186, 2003.
[36] Suthaharan, S., “Support vector machine.”, Machine learning models and algorithms for big data classification, pp. 207-235, Springer, Boston, MA, 2016.
[37] Opitz, D., & Maclin, R., “Popular ensemble methods: An empirical study.”, Journal of artificial intelligence research, 11, pp. 169-198, 1999.
[38] Breiman, L., “Bagging predictors.”, Machine learning, 24(2), pp. 123-140, 1996.
[39] Freund, Y., & Mason, L., “The alternating decision tree learning algorithm.”, icml, Vol. 99, pp. 124-133, June 1999.
[40] Freund, Y., & Schapire, R. E., “A decision-theoretic generalization of on-line learning and an application to boosting.”, Journal of computer and system sciences, 55(1), pp. 119-139, 1997.
[41] Ho, T. K., “Random decision forests.”, Proceedings of 3rd international conference on document analysis and recognition, Vol. 1, pp. 278-282, IEEE, August 1995.
[42] Breiman, L., “Random forests.”, Machine learning, 45(1), pp. 5-32, 2001.
[43] Priyam, A., Abhijeeta, G. R., Rathee, A., & Srivastava, S., “Comparative analysis of decision tree classification algorithms.”, International Journal of current engineering and technology, 3(2), pp. 334-337, 2013
[44] Timofeev, R., “Classification and regression trees (CART) theory and applications.”, Humboldt University, Berlin, pp. 1-40, 2004.
[45] Hassoun, M. H., Fundamentals of artificial neural networks., MIT press, 1995.
[46] Anderson, J. A., “ An introduction to neural networks.”, MIT press.,1995.
[47] Byvatov, E., & Schneider, G., “Support vector machine applications in bioinformatics.”, Applied bioinformatics, 2(2), pp. 67-77, 2003
[48] Dietterich, T. G., “Ensemble methods in machine learning.”, International workshop on multiple classifier systems, pp. 1-15, Springer, Berlin, Heidelberg, June 2000.
[49] 曾憲雄等,資料探勘,旗標科技股份有限公司,2005。
指導教授 蔡志豐 審核日期 2022-4-13
推文 facebook   plurk   twitter   funp   google   live   udn   HD   myshare   reddit   netvibes   friend   youpush   delicious   baidu   
網路書籤 Google bookmarks   del.icio.us   hemidemi   myshare   

若有論文相關問題,請聯絡國立中央大學圖書館推廣服務組 TEL:(03)422-7151轉57407,或E-mail聯絡  - 隱私權政策聲明