應用多任務序列標記模型於零樣本跨語言網頁模板移除之研究;Multi-Task Neural Sequence Labeling for Zero-shot Cross-Lingual Boilerplate Removal

NCU Institutional Repository > 資訊電機學院 > 資訊工程研究所 > 博碩士論文 > Item 987654321/86446

請使用永久網址來引用或連結此文件: http://ir.lib.ncu.edu.tw/handle/987654321/86446

題名:	應用多任務序列標記模型於零樣本跨語言網頁模板移除之研究;Multi-Task Neural Sequence Labeling for Zero-shot Cross-Lingual Boilerplate Removal
作者:	吳昱豪;Wu, Yu-Hao
貢獻者:	資訊工程學系
關鍵詞:	序列標記;模板移除;多任務學習;資訊擷取;Sequence Labeling;Boilerplate Removal;Multi-task Learning;Information extraction
日期:	2021-07-08
上傳時間:	2021-12-07 12:50:48 (UTC+8)
出版者:	國立中央大學
摘要:	在現今的網頁中通常富含了許多種類的資訊，因此移除較為不相關的資訊，例如：導覽列、橫幅、連結列表或是 Footer 的版權宣告等，這些在同一個網站中大量被其他網頁共用的網頁元件，通常是使用者較不感興趣的資訊，而這種主要內文與網頁模板混合的情形，增加了資訊檢索等應用的困難度，而從網頁中擷取主要內容或移除不重要資訊的任務被稱為「模板移除」(Boilerplate Removal)，常見的作法是將網頁內容分成網頁模板 (Boilerplate) 以及主要內文 (Main Content) 這兩大類。在過去的研究方法中，大多採用大量的人為的領域知識特徵如文字相關、DOM 樹相關或者是網頁結構特徵來使用傳統機器學習進行訓練，而近期的深度學習技術則在特徵上只使用 HTML 標籤以及內文資訊，如 BoilerNet 在 CleanEval 資料集中模板及內文均達到一個令人印象深刻分數，然而我們觀察到 BoilerNet 所使用的技術只能應用在單一一種語言上，這與我們實際在網際網路上所面臨的環境並不一致。在此篇論文中，我們探索了 Tag Embedding 的可能性，我們提出了兩種基於多任務學習的框架的輔助任務來擴展現今模板移除的主流技術，使其成為一個能針對任意網頁進行模板移除的多語言模型，且不僅限於任何領域及任何語言的網頁，我們的方法在 CleanEval 上獲得了目前最高的分數，在效能評估上我們採用更能反映實際應用的 Macro F1 來進行評估，另外在跨語言的能力上也使用了 4 個不同的零樣本 (ZeroShot) 實驗進行驗證，在我們進行的每個實驗中，均顯示我們所提出的模型為目前最先進的技術。;Web pages often include various kinds of information, thus removing irrelevant information such as navigation bar, banners, link lists and footer copyrights, these kinds of web components that are shared with many web pages in a website are usually not interested by users. The scenario that main content mix with boilerplate did increase the difficulty for Information Retrieval, the task of extracting main content or remove the irrelevant information from the web page is called "Boilerplate Removal", the common solution is to classify the web component into Boilerplate and Main Content. Several researches are based on numerous hand-crafted domain knowledge like text, DOM tree or web page structure related features and trying to use traditional machine learning techniques to solve the task. Recently, some deep learning methods tried to do this task only use Tag and Content information, like BoilerNet can achieve an impressive score for Noise and Content in CleanEval dataset, however, we observed that BoilerNet can only be used on single language web pages, that is different from the environment we faced in reality. In this paper, we proposed a multi-task learning framework to extend the existing state-of-the-art boilerplate removal model to a new multilingual model, that can deal with arbitrary web pages with no domain and languages limited, and our method achieve best score on CleanEval dataset. We also proposed Macro F1 evaluation metric for better present the real performance in boilerplate removal task, and we use 4 different ZeroShot experiments to validate the cross-lingual ability of our methods. All of the experiment results shows that the proposed multitask learning methods are the state-of-the-art in this task.
顯示於類別:	[資訊工程研究所] 博碩士論文

文件中的檔案:

檔案	描述	大小	格式	瀏覽次數
index.html		0Kb	HTML	42	檢視/開啟

在NCUIR中所有的資料項目都受到原著作權保護.

社群 sharing

資料載入中.....