摘要: | 多標籤文本分類目標是自動分析文字內容自動指派一個或多個事先給定的類別標籤,常見的應用包括情感分析、主題檢測及新聞分類等。我們提出一個標籤圖卷積增強式超圖注意力網路 (Label Graph Convolutions Enhanced Hypergraph Attention Networks, LGC-HyperGAT) 模型,藉由超圖注意力網路以找出字詞與句子的關聯,然後用標籤圖卷積網路建構類別標籤之間隱含關係,最後將其銜接在一起,用來預測文本內容種類。實驗資料分為兩個部分,包含 (1) 中文健康照護資料集(HealthDoc):我們以網路爬蟲蒐集網頁上健康照護相關的新聞、文章專欄以及部落格,並將前處理後的文字內容,由3位大學生人工標記類別標籤,文本總數有2,724篇,平均字數是1,096.91,類別標籤共有9個,分別是疾病資訊、養生保健、心理健康、治療方案、醫療檢測、保健食品、注意事項、藥物以及銀髮族,標籤總數是8,731,平均每篇文章有3.21個標籤。 (2) 中文憂鬱症資料集(PsychPark):此資料是從心靈園地 (http://www.psychpark.org)網站收集,文本為網友提出的精神疾病狀況與敘述,醫師再依據病患提出的心理問題做多標籤分類,文本總數有2,831篇,平均字數是247.89,類別標籤共有21個,標籤總數是4,425,平均每篇文章有1.56個標籤。藉由實驗結果與錯誤分析得知,我們提出的LGC-HyperGAT模型,在HealthDoc和PsyPark資料集分別達到最好的Macro -F1分數0.725和0.35,比相關研究模型 (CNN, LSTM, Bi-LSTM, FastText, BERT, Graph-CNN, TextGCN, Text-Level-GNN, HyperGAT) 的表現來得更好,藉由錯誤分析可知,標籤分類器學習到的隱含特徵可以有效地提升文本分類的效能。;Multi-label text classification task focuses on automatically assigning one or more predefined category labels to the text content. The common applications include sentiment analysis, topic detection, news classification, and so on. We propose a Label Graph Convolutions Enhanced Hypergraph Attention Networks (LGC-HyperGAT) model, in which the hypergraph attention networks are used to formulate the relationships between words and sentences in the text content, and the label graph convolutions networks are used to capture the implicit correlations within the labels, and both kinds of networks are finally connected to predict the content labels. There are two experimental datasets including 1) Chinese healthcare dataset (HealthDoc): We firstly crawled to collect health-related news, articles, and blogs on the web. After preprocessing the text content, three undergraduate students were trained to annotate the category manually. A total of 2724 documents were annotated and each contained 1096.91 words on average. There are 9 category labels including disease, health protection, mental health, treatment, examination, ingredient, caution, drug, and elder. The total number of labels is 8,731. Each document contains an average of 3.21 labels. 2) Chinese depression dataset (PsychPark): This data is collected from the PsychPark website (http://www.psychpark.org). Users propose mental illnesses and then doctors classify psychological diseases according to their self-descriptions. The total number of texts is 2,831 and the average number of words is 247.89. The total number of labels is 4,425 across 21 categories with an average of 1.56 labels per document. Based on the experimental results, our proposed LGC-HyperGAT model respectively achieved the best Macro-F1 scores of 0.725 and 0.35 in the HealthDoc and PsyPark datasets, which are better than related models (CNN, LSTM, Bi-LSTM, FastText). , BERT, Graph-CNN, TextGCN, Text-Level-GNN, HyperGAT). Through error analysis, the features learned by the label classifier can effectively improve the performance of multi-label text classification. |