摘要: | 自《維基百科》上線以來,徹底顛覆人們學習新知的方法。最初上線的《英文維基》挾著強勢語種,成為不同語言中最豐富也最多人使用的版本,比起其他版本的文章數量是幾倍之多。以《中文維基》為例,《英文維基》的文章數量是它的六倍,但中文卻是在世界上語言使用人口之冠。我們認為文章數量不平衡的理由有三:一、《中文維基》上線較晚,且較不為人熟知;二、中文使用者大部分居於中國,但中國防火長城阻擋使用者存取中文維基;三、中國傾向扶植國內線上百科—《百度百科》。 此外「跨語言間連結數量不足」也是《維基百科》中一大課題,據統計《英文維基》中的文章具有「內部跨語連結」至中文只有2.3%,而中文連結至英文也只有六成[1]。此現象也發生在其他語言中:日文[4]、德文[5]等等。跨語連結的缺乏不利於全球知識共享,例如:文章論點分析、文化交流、資訊傳遞,以及跨語言相關研究(問答系統、資訊檢索、機器翻譯)等。 如上述,除《維基百科》外,也有類似《百度百科》,提供特定語言的線上百科存在,西班牙語有《Enciclopedia Libre》;德語有《Wikiweise》;俄文有《WikiZnanie》等,因此在「跨線上百科中建置跨語言連結」成了我們的目標,不僅能豐富非英文的文章,也能增加跨語連結,一舉兩得。 我們鎖定《英文維基》和《百度百科》作為目標,透過選取候選條目縮小搜尋範圍 和支持向量機判斷中英兩篇文章是否為相對應的跨語言文章。其中使用機器翻譯、文字相似度、上謂詞、基於分類系統的文章向量作為特徵,提供支持向量機作為判斷依據,在兩個資料集上分別獲得 0.8019(+0.477) 和0.6824(+0.157)平均倒數排名。 因我們連結方法不依賴語言特性,能夠輕易轉換至其他語言,期待未來能套用至更 多線上百科之間;再者本研究特色特徵:「基於分類系統的文章向量」能透過向量表達文章分類,我們更期許能將其應用於其他領域,達成更簡便且有效的成果。;Our goal is to link corresponding articles from English Wikipedia to Chinese Baidu Baike, which is called “Cross-language Article Linking.” According to Wang et al. (2014) [1]’s statistics, there are only 2.3% English Wikipedia articles link to their Chinese version. On the other hand, Chinese article is 60%. Moreover, the number of articles between those two has a tremendous gap. Because of the unbalance number of articles and lacking inner cross-language links between different Wikipedia versions, CLAL has become a major issue. Without cross-language links, there are many things cannot be done, for example, global knowledge sharing, cross-language information retrieval, machine translation, machine understanding, etc. Fortunately, there are other wiki-like online encyclopediae, most of them are local exclusive, such like: “Enciclopedia Libre” in Spanish, “WikiZnanie” in Russian and “Wikiweise” in German. We can take good use of them to link to Wikipedia and solve the unbalance and inner cross-language problem. We target English Wikipedia and Chinese Baidu Baike. Our approach has two stage: candidate selection and link prediction. Using Candidate selection reduce the search space from entire encyclopedia into 10 candidates. We employ LIBSVM as machine learning tool to predict two articles should be linked or not. The SVM features include title similarity and matching, English title occurrence, hypernym translation and category-based embedding which we special design in this thesis. In the experiment, we reach 0.8019 and 0.6824 in both datasets, and both of them are better than baseline (+0.477 and +0.157). Because our approach doesn’t rely on language-dependent features, the flexibility let it suits for other wiki-like encyclopediae. For our core design, CBE, We expect that it can apply to variety domain, to make good easy results. |