dc.description.abstract | Our goal is to link corresponding articles from English Wikipedia to Chinese Baidu Baike, which is called “Cross-language Article Linking.” According to Wang et al. (2014) [1]’s statistics, there are only 2.3% English Wikipedia articles link to their Chinese version. On the other hand, Chinese article is 60%. Moreover, the number of articles between those two has a tremendous gap.
Because of the unbalance number of articles and lacking inner cross-language links between different Wikipedia versions, CLAL has become a major issue. Without cross-language links, there are many things cannot be done, for example, global knowledge sharing, cross-language information retrieval, machine translation, machine understanding, etc.
Fortunately, there are other wiki-like online encyclopediae, most of them are local exclusive, such like: “Enciclopedia Libre” in Spanish, “WikiZnanie” in Russian and “Wikiweise” in German. We can take good use of them to link to Wikipedia and solve the unbalance and inner cross-language problem.
We target English Wikipedia and Chinese Baidu Baike. Our approach has two stage: candidate selection and link prediction. Using Candidate selection reduce the search space from entire encyclopedia into 10 candidates. We employ LIBSVM as machine learning tool to predict two articles should be linked or not. The SVM features include title similarity and matching, English title occurrence, hypernym translation and category-based embedding which we special design in this thesis.
In the experiment, we reach 0.8019 and 0.6824 in both datasets, and both of them are better than baseline (+0.477 and +0.157). Because our approach doesn’t rely on language-dependent features, the flexibility let it suits for other wiki-like encyclopediae. For our core design, CBE, We expect that it can apply to variety domain, to make good easy results. | en_US |