博碩士論文 107453022 完整後設資料紀錄

DC 欄位 語言
DC.contributor資訊管理學系zh_TW
DC.creator陳美君zh_TW
DC.creatorMei-Chun Chenen_US
dc.date.accessioned2020-7-3T07:39:07Z
dc.date.available2020-7-3T07:39:07Z
dc.date.issued2020
dc.identifier.urihttp://ir.lib.ncu.edu.tw:88/thesis/view_etd.asp?URN=107453022
dc.contributor.department資訊管理學系zh_TW
DC.description國立中央大學zh_TW
DC.descriptionNational Central Universityen_US
dc.description.abstract臺灣醫療品質享譽國際,全民納保的健康保險制度在1995年開辦後,龐大的就醫資料為臺灣醫療發展在世界上建立影響力的基石。然而,大量的就醫需求造成醫療費用不斷上漲,使全民健保制度長期運作受到重大挑戰。為使醫療資源更加妥善利用,衛生福利部積極修訂支付制度,因此,以疾病分類碼為基礎的申報制度和費用給付密切相關,編碼的適當性、正確性及完整性成為醫療給付的重要關鍵。 為了解決醫療提供者於臨床醫學疾病名詞多樣性與複雜度,及病歷文字非結構化資料必須運用人力閱讀及理解方能正確分類診斷的困境,本研究應用自然語言處理N-Gram和TF-IDF技術從去識別化的真實病歷資料提取文字特徵向量,搭配機器學習建構四個預測分類模型:SVM、MLP、GBDT與LightGBM,使用交叉驗證減少模型偏差的狀況,評估模型的方式使用Confusion Matrix的Accuracy、Precision、Recall、F1 Score和AUC來檢驗模型的分類效果並比較與分析。並透過三個實驗設計探討臨床醫學上常見的類別不平衡、醫學名詞還原和醫師個人撰寫風格差異問題。 最後結果顯示LightGBM的預測結果優於其他模型,尤其在訓練時間有出色的表現。類別平衡處理有助於提高分類器效果。醫學名詞縮寫具有獨特性,有助於分類判斷。疾病為專有的醫學名詞,雖然醫師表達方式不同,但並不影響對同一疾病的描述方式,不同科別的醫師撰寫病歷風格不影響分類模型結果。zh_TW
dc.description.abstractTaiwan’s qualities of medicine and health cares are on the top of the world. Millions of electronic medicine recorders (EMR) from citizens can be collected from the National Health Insurance (NHI), which was founded in 1995. Moreover, these EMRs have become the basis of the medical technologies evolutions in Taiwan. Although NHI is good, it needs lots of money to perform social operations, and the rapidly increasing costs from all perspective of medical needs make its situation even worse. To overcome the problem and improve the resource efficiency, the NHI Administration defines lots of systems to ensure all resources are used in the correct way, and one of these systems is ICD-10-CM/PCS. The correct code in ICD-10-CM/PCS is the key of NHI benefits. To address the complexity of medical terminologies, the N-gram and TF-IDF technologies of NLP were applied on real EMRs with De-identification in this research. In addition, SVM, MLP, GBDT, and LightGBM models with Cross-validation are constructed. All of these four models are compared and analyzed in terms of Accuracy, Precision, Recall, F1 Score and AUC in Confusion Matrix. On the other hand, three experiments are designed for the impacts of the personal writing style, the screw of terminologies in different subjects, and the needs of abbreviation restoration. The result reveals that LightGBM provides better performance and, especially, its training time is superior to others, as well as the classification model has better performances if the original imbalanced training set is balanced after some preprocess stage. The abbreviation of medical terminologies, not like general ones used by normal people, it could contribute to the model because of uniqueness. Diseases are all proper nouns, thus the same disease might be described differently by different doctors due to personal writing styles, but the features selected in the training model would remain the same; the writing styles has no influences to the model and its result.en_US
DC.subject疾病分類zh_TW
DC.subject自然語言處理zh_TW
DC.subject機器學習zh_TW
DC.subjectLightGBMzh_TW
DC.subjectDisease Classificationen_US
DC.subjectNatural Language Processingen_US
DC.subjectMachine Learningen_US
DC.subjectLightGBMen_US
DC.title應用自然語言處理與機器學習於疾病分類編碼之探討zh_TW
dc.language.isozh-TWzh-TW
DC.titleNatural Language Processing and Machine Learning Techniques for Disease Classification of Medical Recordsen_US
DC.type博碩士論文zh_TW
DC.typethesisen_US
DC.publisherNational Central Universityen_US

若有論文相關問題,請聯絡國立中央大學圖書館推廣服務組 TEL:(03)422-7151轉57407,或E-mail聯絡  - 隱私權政策聲明