dc.description.abstract | Taiwan’s qualities of medicine and health cares are on the top of the world. Millions of electronic medicine recorders (EMR) from citizens can be collected from the National Health Insurance (NHI), which was founded in 1995. Moreover, these EMRs have become the basis of the medical technologies evolutions in Taiwan. Although NHI is good, it needs lots of money to perform social operations, and the rapidly increasing costs from all perspective of medical needs make its situation even worse. To overcome the problem and improve the resource efficiency, the NHI Administration defines lots of systems to ensure all resources are used in the correct way, and one of these systems is ICD-10-CM/PCS. The correct code in ICD-10-CM/PCS is the key of NHI benefits.
To address the complexity of medical terminologies, the N-gram and TF-IDF technologies of NLP were applied on real EMRs with De-identification in this research. In addition, SVM, MLP, GBDT, and LightGBM models with Cross-validation are constructed. All of these four models are compared and analyzed in terms of Accuracy, Precision, Recall, F1 Score and AUC in Confusion Matrix. On the other hand, three experiments are designed for the impacts of the personal writing style, the screw of terminologies in different subjects, and the needs of abbreviation restoration.
The result reveals that LightGBM provides better performance and, especially, its training time is superior to others, as well as the classification model has better performances if the original imbalanced training set is balanced after some preprocess stage. The abbreviation of medical terminologies, not like general ones used by normal people, it could contribute to the model because of uniqueness. Diseases are all proper nouns, thus the same disease might be described differently by different doctors due to personal writing styles, but the features selected in the training model would remain the same; the writing styles has no influences to the model and its result. | en_US |