應用機器學習建立單位健保欠費催繳後繳納預測模型;Using Machine Learning to build a Prediction Model for NHI Premium Payment after Arrear Reminder of Insured Units

NCUIR > School of Management at National Central University > Executive Master of Information Management > Electronic Thesis & Dissertation > Item 987654321/88344

Please use this identifier to cite or link to this item: https://ir.lib.ncu.edu.tw/handle/987654321/88344

Title:	應用機器學習建立單位健保欠費催繳後繳納預測模型;Using Machine Learning to build a Prediction Model for NHI Premium Payment after Arrear Reminder of Insured Units
Authors:	王耘;Wang, Yun
Contributors:	資訊管理學系在職專班
Keywords:	健保欠費;投保單位;機器學習;分類預測;特徵選取;Arrear of NHI Premium;Insured Unit;Machine Learning;Classification Prediction;Feature Selection
Date:	2022-04-13
Issue Date:	2022-07-13 23:29:00 (UTC+8)
Publisher:	國立中央大學
Abstract:	為確保全民健保永續經營，凡符合投保資格的民眾皆應加保並負起繳納保費義務，全民納保率並已達99.9％，惟保費收繳率卻低於此，出現有投保卻未繳費之不公平情形，因此，應積極處理欠費議題，在有限的行政經費資源下，將資源有效利用，發揮最大的保費收回效益，並促使全民負起加保即應繳納保費之義務。　　是以，本研究希能透過機器學習方法精準找出能有效實施提升保費收繳率對策之對象，茲以健保署北區業務組之保費年度為108年的投保單位欠費資料為研究對象，作為建立預測模型之訓練資料集，透過未簡化維度及以特徵選取（資訊增益、基因演算法）簡化維度，以22項維度進行分析，包括欠費特徵3項、單位特徵13項及負責人特徵6項，再分別以單一分類器（CART決策樹、多層感知器、支援向量機）及集成式學習（隨機森林、Bagging及AdaBoost）建立投保單位健保欠費催繳後繳納預測模型。　　本研究建立之預測模型係用以預測當投保單位欠費經催繳後，其截至寬限期後一年內繳納與否之情形，並透過建立的預測模型提出建議改善策略，以更精準的方式進行催繳，即針對預測為期間內不繳納之欠費，且原先以平信寄發催繳通知者，逕改以雙掛號催繳，不僅可節省平信寄發郵資，更重要的是，將雙掛號的送達時程提早至少四個月，加速後續行政執行流程，方確保優先受償，如此，透過強化是類案件之行政執行前之催繳作業流程，促使該筆欠費債權回收的機率提高，及早把握欠費投保單位受償先機。　　經比較各分類器ROC曲線下面積之AUC數值及模型建立時間，以隨機森林表現最佳，其次依序為Boosting結合CART、Bagging結合CART及單一分類器CART，顯示集成式學習確實較單一分類器的效益為佳。而隨機森林模型中，不論是未簡化維度、以資訊增益簡化維度或以基因演算法簡化維度，AUC數值皆達0.974，即具有極佳的鑑別力，且經T檢定判定三者無顯著差異。而多層感知器及支援向量機則囿於本研究資料量較大，致其AUC數值相對較差，且模型運算建立時間也較久，故用於本研究資料集中之表現較差。　　本研究為進一步驗證各模型對未來新年度資料之預測效果的表現情形，茲以保費年月為109年1月及2月（觀察期間至保費繳納寬限期110年4月15日）的投保單位欠費資料作為測試資料集，研究結果顯示在隨機森林預測模型中，以資訊增益簡化維度的AUC數值0.828為最佳，仍具有優良的鑑別力，雖僅較未簡化維度的AUC數值0.827微高，但由於透過特徵選取能簡化維度，不僅能減少儲存空間，建立模型也相對快一些，為整體效益最好的分類預測模型，希本研究結果能提供健保署作為即早進行欠費監控之選案依據，達到提升保費收回的效果，對健保的永續發展發揮相當助益。;To ensure the sustainability of NHI, all citizens who meet insurance qualifications should be insured and pay premiums. The universal coverage rate has reached 99.9%, but the premium collection rate is lower than this. Therefore, we should deal with the issue of arrears actively. It can not only make effective use of resources under limited administrative funds, maximize the recovery of arrears, but also urge insured to assume the obligation to pay premiums. Therefore, the research aims to identify objects accurately which can be implemented strategies effectively to increase the premium collection rate through machine learning. The object of the research is the arrear data in 2019 of the insured units of northern division of NHIA, which is the training dataset of the prediction model. The prediction model includes no dimension reduction and dimension reduction by feature selection (information gain, genetic algorithm), and analyzes with 22 dimensions, including 3 features of arrear, 13 features of insured unit and 6 features of the person in charge. Then, the single classifier (CART decision tree, multi-layer perceptron and support vector machine) and ensemble learning (random forest, Bagging and AdaBoost) were used to build the prediction model for NHI premium payment after arrear reminder of insured units. The classifier model is used to predict whether the insured unit will pay premium within one year after the grace period after arrear reminder. To send the urge reminder in a more accurate way, we propose an improvement strategy for those predicted not to pay the arrears within one year after the grace period, which is to send the arrear reminder by double registered mail instead of original mail. This strategy can not only save the postage for ordinary mail, but more importantly, achieve the effect of delivery at least 4 months earlier, so the subsequent administrative execution process would be accelerated to ensure the priority of compensation which can increase the probability of the premium collection rate. To compare the AUC value and the model building time of each classifier, it shows the random forest performs the best, followed by Boosting combined with CART, bagging combined with CART, and CART. That is, ensemble learning is indeed better than single classifier. In the random forest model, whether the dimension is simplified or not, the AUC value all reach 0.974, which have excellent discrimination, and the T test shows that there is no significant difference. On the other hand, the multi-layer perceptron and support vector machine perform relatively poor due to the large amount of the dataset. In order to verify the prediction performance of the new data, the arrear data of January 2020 and February 2020 is used as test dataset in the study. And the result shows that among the random forest model, the information gain performs the best as 0.828 of AUC, which just greater than no dimension reduction 0.827 of AUC slightly. However, the dimensions can be reduced through information gain of feature selection, so it can not only reduce the storage space, but also build models relatively quickly. Overall, the random forest model used information gain is the best classification prediction model. Moreover, the results of the study can be provided to NHIA as an basis for monitoring arrears to improve the premium collection rate.
Appears in Collections:	[Executive Master of Information Management] Electronic Thesis & Dissertation

Files in This Item:

File	Description	Size	Format
index.html		0Kb	HTML	174	View/Open

社群 sharing

Loading...