摘要: | 釋義生成(Paraphrase generation)一直是文字探勘 (Natural language processing)中重要任務之一。目的在於保留相同的語意但不同的句法結構。至於這個任務大致可歸類為監督式學習,半監督式學習以及非監督式學習三種模式。而在監督式學習已經有顯著的成果,在各個評估指標中都已經達到很好的表現。至於半監督式學習和非監督式學習則是還在研究階段,所以比較少的研究在討論這個任務。也是因為這樣的原因,本研究想探討非監督式學習的方法。 另外在釋義生成的方法中,有部分的研究在探討控制生成的方法,主要目的在於保留句子中部份重要的詞彙來避免語意的改變,舉例來說 “川普有一隻狗。” 對於這句話而言川普和狗就是無法改變的文字,如果將川普改變成希拉蕊的話整句話的意思就改變了。而做到控制生成的方式亦可分為幾種,有的是利用句法結構(Syntactic Structure) 來做到控制生成。也有利用模型的輔助來達到控制生成。而為了探討在非監督式學習模型中做到控制生成,我們的研究修改了Transformer模型的架構,在架構中我們新增了命名實體 (Named Entity) 的概念,原因是因為研究指出這些帶有命名實體的詞通常都是句子中不可被替換的詞語。此實驗我們將帶有命名實體標籤的詞彙列為不可取代的詞彙。因此在模型的學習中,我們期望能將帶有命名實體標籤的詞被模型重點學習,因而在輸入層中新增了命名實體標籤的詞遷入並結合其所在位置資訊進行學習。 從實驗結果中,我們提出了一個判斷是否有效地保留命名實體的方法,我們計算命名實體的招回率 (Recall) 來辨別是否有正確招回帶有命名實體的詞彙。另外我們在結果中顯示我們的招回率是比基準模型來的好的,同時我們也比較了基準模型的主要判斷指標,iBLEU。 iBLEU是BLEU的延伸主要是判斷生成出來的句子跟目標句子語意保留程度。而iBLEU則是帶有逞罰機制的BLEU。在我們的結果中絕大部分的iBLEU成績都是比基準模型來的好的。這也間接說明,命名實體對於模型是有潛在的影響力。 ;Paraphrase generation is one of the important tasks in natural language processing (NLP). The purpose is to retain the same semantic meaning but a different syntactic structure. As for this task, it can be classified as supervised learning, semi-supervised learning, and unsupervised learning. There are several promising results in supervised learning, and good performance has been achieved in various indicators. As semi-supervised learning and unsupervised learning are still in the research stage, there is not much research discussing this task. For this reason, this research explores unsupervised paraphrasing. In addition, no matter the supervised methods or unsupervised methods for paraphrase generation, some researchers are exploring the method of controlling the generation. The main purpose is to preserve the important vocabulary in the sentence to avoid the change of meaning. For example, “Trump has a dog”. In this sentence, Trump and the dog are the words that cannot be converted. If Trump converts into Hillary Clinton, the meaning of the entire sentence will be changed. There are several ways to control the generation that some use syntactic structure to achieve controllable generation. Some are proposing the method of modifying the model to achieve the controllable. In our research, we modified the structure of the Transformer model. In the structure, we added the concept of the introduced Named Entity (NE). The reason is that usually, these words with NE are irreplaceable in the sentence. In this study, we assume words with NE tags as irreplaceable words. Therefore, in the training phase, we expect the words with NE tags can be learned by the model. Consequently, we combine the embedding of NE tags with position encoding and input token for model training. From the experimental results, we proposed a method to judge whether the entity is effectively retained. We calculated the NE′s recall. The recall score is better than that of the baseline model, and we also compared the main evaluation metric of the baseline model, iBLEU. iBLEU is an extension of BLEU which mainly determines the degree of semantic retention of the generated sentence and the target sentence. In addition, iBLEU is BLEU with a penalty mechanism. In our results, iBLEU scores are better than the benchmark. This can also show that our method of using NE constraints has potential influence. |