基於分類系統建立文章表示向量應用於跨語言線上百科連結;Cross-Language Encyclopedia Article Linking Using Category-based Embedding and English Title Occurrence with Edit Distance

NCU Institutional Repository > 資訊電機學院 > 資訊工程研究所 > 博碩士論文 > Item 987654321/72637

請使用永久網址來引用或連結此文件: http://ir.lib.ncu.edu.tw/handle/987654321/72637

題名:	基於分類系統建立文章表示向量應用於跨語言線上百科連結;Cross-Language Encyclopedia Article Linking Using Category-based Embedding and English Title Occurrence with Edit Distance
作者:	王泓翔;Wang, Hung-Hsiang
貢獻者:	資訊工程學系
關鍵詞:	線上百科;跨語言;連結;文章表示向量;維基百科;百度百科;online encyclopedia;cross-language;linking;article representation vector;Wikipedia;Baidu Baike
日期:	2016-10-26
上傳時間:	2017-01-23 17:09:21 (UTC+8)
出版者:	國立中央大學
摘要:	自《維基百科》上線以來，徹底顛覆人們學習新知的方法。最初上線的《英文維基》挾著強勢語種，成為不同語言中最豐富也最多人使用的版本，比起其他版本的文章數量是幾倍之多。以《中文維基》為例，《英文維基》的文章數量是它的六倍，但中文卻是在世界上語言使用人口之冠。我們認為文章數量不平衡的理由有三：一、《中文維基》上線較晚，且較不為人熟知；二、中文使用者大部分居於中國，但中國防火長城阻擋使用者存取中文維基；三、中國傾向扶植國內線上百科—《百度百科》。此外「跨語言間連結數量不足」也是《維基百科》中一大課題，據統計《英文維基》中的文章具有「內部跨語連結」至中文只有2.3%，而中文連結至英文也只有六成[1]。此現象也發生在其他語言中：日文[4]、德文[5]等等。跨語連結的缺乏不利於全球知識共享，例如：文章論點分析、文化交流、資訊傳遞，以及跨語言相關研究(問答系統、資訊檢索、機器翻譯)等。如上述，除《維基百科》外，也有類似《百度百科》，提供特定語言的線上百科存在，西班牙語有《Enciclopedia Libre》；德語有《Wikiweise》；俄文有《WikiZnanie》等，因此在「跨線上百科中建置跨語言連結」成了我們的目標，不僅能豐富非英文的文章，也能增加跨語連結，一舉兩得。我們鎖定《英文維基》和《百度百科》作為目標，透過選取候選條目縮小搜尋範圍和支持向量機判斷中英兩篇文章是否為相對應的跨語言文章。其中使用機器翻譯、文字相似度、上謂詞、基於分類系統的文章向量作為特徵，提供支持向量機作為判斷依據，在兩個資料集上分別獲得 0.8019(+0.477) 和0.6824(+0.157)平均倒數排名。因我們連結方法不依賴語言特性，能夠輕易轉換至其他語言，期待未來能套用至更多線上百科之間；再者本研究特色特徵：「基於分類系統的文章向量」能透過向量表達文章分類，我們更期許能將其應用於其他領域，達成更簡便且有效的成果。;Our goal is to link corresponding articles from English Wikipedia to Chinese Baidu Baike, which is called “Cross-language Article Linking.” According to Wang et al. (2014) [1]’s statistics, there are only 2.3% English Wikipedia articles link to their Chinese version. On the other hand, Chinese article is 60%. Moreover, the number of articles between those two has a tremendous gap. Because of the unbalance number of articles and lacking inner cross-language links between different Wikipedia versions, CLAL has become a major issue. Without cross-language links, there are many things cannot be done, for example, global knowledge sharing, cross-language information retrieval, machine translation, machine understanding, etc. Fortunately, there are other wiki-like online encyclopediae, most of them are local exclusive, such like: “Enciclopedia Libre” in Spanish, “WikiZnanie” in Russian and “Wikiweise” in German. We can take good use of them to link to Wikipedia and solve the unbalance and inner cross-language problem. We target English Wikipedia and Chinese Baidu Baike. Our approach has two stage: candidate selection and link prediction. Using Candidate selection reduce the search space from entire encyclopedia into 10 candidates. We employ LIBSVM as machine learning tool to predict two articles should be linked or not. The SVM features include title similarity and matching, English title occurrence, hypernym translation and category-based embedding which we special design in this thesis. In the experiment, we reach 0.8019 and 0.6824 in both datasets, and both of them are better than baseline (+0.477 and +0.157). Because our approach doesn’t rely on language-dependent features, the flexibility let it suits for other wiki-like encyclopediae. For our core design, CBE, We expect that it can apply to variety domain, to make good easy results.
顯示於類別:	[資訊工程研究所] 博碩士論文

文件中的檔案:

檔案	描述	大小	格式	瀏覽次數
index.html		0Kb	HTML	599	檢視/開啟

在NCUIR中所有的資料項目都受到原著作權保護.

社群 sharing

資料載入中.....