中文文章級別人物關係擷取之研究;Research on Document-Level Person Relation Extraction in Chinese

NCU Institutional Repository > 資訊電機學院 > 資訊工程研究所 > 博碩士論文 > Item 987654321/95832

jsp.display-item.identifier=請使用永久網址來引用或連結此文件: http://ir.lib.ncu.edu.tw/handle/987654321/95832

题名:	中文文章級別人物關係擷取之研究;Research on Document-Level Person Relation Extraction in Chinese
作者:	洪閔昭;Hung, Min-Chao
贡献者:	資訊工程學系
关键词:	關係擷取;文章級關係擷取;命名實體識別;聯合實體關係擷取;Relation Extraction;Document-level Relation Extraction;Named Entity Recognitio;Joint Entity and Relation Extraction
日期:	2024-08-20
上传时间:	2024-10-09 17:19:09 (UTC+8)
出版者:	國立中央大學
摘要:	本研究的動機在於構建一套聯合實體關係擷取的架構，使其能夠應用於真實的網路資料中。目前現有的資料集通常來自單一資料源，如維基百科等，因此，這些資料集所訓練出來的模型難以泛化到多樣化的網絡內容。此外，現有資料集主要集中在句子級別，而跨句子、跨段落的實體關係識別在現實應用中更為常見，但針對文章級別的關係擷取任務研究相對不足。針對中文資料集的缺乏，我們利用先進的大型語言模型來協助標記資料，推動中文關係擷取研究的進展。我們的研究提出了一個通用式生成的標記流程，通過使用 Gemini 以及 GPT-3.5 等大型語言模型協助標記未標記的文章級內容，這樣可以節省大量人力和時間資源，並提高標記效率和準確性。我們利用 Common Crawl 數據作為標記資料集的資料庫來源，構建了一個更具泛用性的資料集，解決了傳統資料集來源單一的問題。此外，得益於大型語言模型（LLM）能力的增強，我們實驗將篇幅較大的文章放入模型進行處理，也確實擷取出了約 30% 的跨句子關係。為了解決單一模型的盲點，我們採用了交叉驗證的方式來提高標記結果的可信度，並且引入了實體擴充方法，補足了模型在面對大量實體時所面臨的實體對取樣不足問題，從而擴充了我們整體標記的完整性。最後，我們還利用參數量較小的預訓練模型對我們標記的資料集進行微調，評估其在真實網路資料中的效能。這樣的微調過程不僅能夠檢驗標記資料集的品質，還能進一步提升模型在真實網絡環境下的適應能力總體而言，我們的研究在技術方法上有所創新，並為未來的關係擷取和命名實體識別研究提供了新的思路和資源。我們期待這些新方法和資源能夠在多樣化的網絡數據中得到更廣泛的應用和驗證，推動該領域的進一步發展。;The motivation of this study is to construct a joint entity-relation extraction framework for real-world web data. Existing datasets typically come from single sources like Wikipedia, making models trained on them struggle to generalize to diverse web content. Additionally, these datasets focus mainly on sentence-level information, while cross-sentence and cross-paragraph entity-relationship recog- nition is more common in real applications. However, research on document-level relationship extraction is insufficient. To address the lack of Chinese datasets, we leverage advanced large language models for data annotation, advancing research in Chinese relationship extraction. Our study proposes a universal generative annotation process, using large language models such as Gemini and GPT-3.5 to annotate unmarked document- level content. This approach saves significant human and time resources while improving annotation efficiency and accuracy. We use Common Crawl data as the source for our dataset, creating a more versatile dataset and addressing the issue of single-source datasets. Thanks to the enhanced capabilities of large language models (LLMs), we experimented with processing longer documents and successfully extracted approximately 30% of cross-sentence relationships. To address the limitations of a single model, we adopted a cross-validation approach to improve annotation credibility. We also introduced an entity aug- mentation method to address the issue of insufficient entity pair sampling, en- hancing overall annotation completeness. Finally, we fine-tuned our dataset using smaller parameter pre-trained mod- els to evaluate its performance on real web data. This fine-tuning process tests the quality of the dataset and enhances the model’s adaptability to different web environments. Overall, our study introduces innovative technical methods and provides new ideas and resources for future research in relationship extraction and named entity recognition. We anticipate broader application and validation in diverse web data, promoting further development in this field.
显示于类别:	[資訊工程研究所] 博碩士論文

文件中的档案:

档案	描述	大小	格式	浏览次数
index.html		0Kb	HTML	25	检视/开启

在NCUIR中所有的数据项都受到原著作权保护.

社群 sharing

数据加载中.....