本研究的動機在於構建一套聯合實體關係擷取的架構,使其能夠應用於 真實的網路資料中。目前現有的資料集通常來自單一資料源,如維基百科等, 因此,這些資料集所訓練出來的模型難以泛化到多樣化的網絡內容。此外,現 有資料集主要集中在句子級別,而跨句子、跨段落的實體關係識別在現實應用 中更為常見,但針對文章級別的關係擷取任務研究相對不足。針對中文資料集 的缺乏,我們利用先進的大型語言模型來協助標記資料,推動中文關係擷取研 究的進展。 我們的研究提出了一個通用式生成的標記流程,通過使用 Gemini 以及 GPT-3.5 等大型語言模型協助標記未標記的文章級內容,這樣可以節省大量人 力和時間資源,並提高標記效率和準確性。我們利用 Common Crawl 數據作 為標記資料集的資料庫來源,構建了一個更具泛用性的資料集,解決了傳統資 料集來源單一的問題。此外,得益於大型語言模型(LLM)能力的增強,我們 實驗將篇幅較大的文章放入模型進行處理,也確實擷取出了約 30% 的跨句子 關係。 為了解決單一模型的盲點,我們採用了交叉驗證的方式來提高標記結果 的可信度,並且引入了實體擴充方法,補足了模型在面對大量實體時所面臨的 實體對取樣不足問題,從而擴充了我們整體標記的完整性。 最後,我們還利用參數量較小的預訓練模型對我們標記的資料集進行微 調,評估其在真實網路資料中的效能。這樣的微調過程不僅能夠檢驗標記資料 集的品質,還能進一步提升模型在真實網絡環境下的適應能力 總體而言,我們的研究在技術方法上有所創新,並為未來的關係擷取和 命名實體識別研究提供了新的思路和資源。我們期待這些新方法和資源能夠在 多樣化的網絡數據中得到更廣泛的應用和驗證,推動該領域的進一步發展。;The motivation of this study is to construct a joint entity-relation extraction framework for real-world web data. Existing datasets typically come from single sources like Wikipedia, making models trained on them struggle to generalize to diverse web content. Additionally, these datasets focus mainly on sentence-level information, while cross-sentence and cross-paragraph entity-relationship recog- nition is more common in real applications. However, research on document-level relationship extraction is insufficient. To address the lack of Chinese datasets, we leverage advanced large language models for data annotation, advancing research in Chinese relationship extraction. Our study proposes a universal generative annotation process, using large language models such as Gemini and GPT-3.5 to annotate unmarked document- level content. This approach saves significant human and time resources while improving annotation efficiency and accuracy. We use Common Crawl data as the source for our dataset, creating a more versatile dataset and addressing the issue of single-source datasets. Thanks to the enhanced capabilities of large language models (LLMs), we experimented with processing longer documents and successfully extracted approximately 30% of cross-sentence relationships. To address the limitations of a single model, we adopted a cross-validation approach to improve annotation credibility. We also introduced an entity aug- mentation method to address the issue of insufficient entity pair sampling, en- hancing overall annotation completeness. Finally, we fine-tuned our dataset using smaller parameter pre-trained mod- els to evaluate its performance on real web data. This fine-tuning process tests the quality of the dataset and enhances the model’s adaptability to different web environments. Overall, our study introduces innovative technical methods and provides new ideas and resources for future research in relationship extraction and named entity recognition. We anticipate broader application and validation in diverse web data, promoting further development in this field.