我們利用2014年i2b2中心舉辦的自然語言競賽,第二項任務的測試資料集來評估系統的實驗結果中發現,使用基於條件隨機域的系統得到F-score 88.27%的成績,而在添加規則語法的組態達到了F-Score 89.74%,提高了F-score 1.47%的效能,最後加上後處理所做出來我們目前最佳的F-score 91.74%,改善2%的成績。;The electronic medical records of patients provide detailed health information, and risk factors of disease effect patient on illness, thus they are an important target for medical text mining. The top one cause to death is coronary artery disease from 2012 to 2013, so detecting the risk factor of heart disease and tracking their progression over sets of longitudinal records is helpful to refer and prevent the heart disease. Risk factors are presented as named entity, part-of-sentence, tabular, and multi-sentence expressions in medical records; therefore, it is difficult to detect them using a single approach.
In this paper, we present a hybrid approach to this task by developing three systems based on the conditional random fields (CRF) model, each of which targets one of three major risk factor categories: disease, medication, and smoker. To recognize risk factors not found by our CRF-based systems, our team formulate syntactic rules based on physiological indicators and risk factor keywords. To track patient progression longitudinally, we also use maximum entropy to label the identified risk factor mentions with tags that describe their relation to the document creation time.
Our experimental results show that our CRF-based systems achieve an F-score of 88.27% on the i2b2 2014 Track 2 test dataset. Adding the various rules improves the F-score by 1.47% and achieves an F-score of 89.74%. Finally we combine previous system and post-processing, and the system achieves 91.74% and improve the F-score 2%.