基於分類系統建立文章表示向量應用於跨語言線上百科連結

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：144

、訪客IP：3.144.230.82

姓名

王泓翔(Hung-Hsiang Wang) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

基於分類系統建立文章表示向量應用於跨語言線上百科連結
(Cross-Language Encyclopedia Article Linking Using Category-based Embedding and English Title Occurrence with Edit Distance)

相關論文

★ A Real-time Embedding Increasing for Session-based Recommendation with Graph Neural Networks	★ 基於主診斷的訓練目標修改用於出院病摘之十代國際疾病分類任務
★ 混合式心臟疾病危險因子與其病程辨識於電子病歷之研究	★ 基於 PowerDesigner 規範需求分析產出之快速導入方法
★ 社群論壇之問題檢索	★ 非監督式歷史文本事件類型識別──以《明實錄》中之衛所事件為例
★ 應用自然語言處理技術分析文學小說角色之關係：以互動視覺化呈現	★ 基於生醫文本擷取功能性層級之生物學表徵語言敘述：由主成分分析發想之K近鄰算法
★ Code-Mixing Language Model for Sentiment Analysis in Code-Mixing Data	★ 藉由加入多重語音辨識結果來改善對話狀態追蹤
★ 對話系統應用於中文線上客服助理:以電信領域為例	★ 應用遞歸神經網路於適當的時機回答問題
★ 使用多任務學習改善使用者意圖分類	★ 使用轉移學習來改進針對命名實體音譯的樞軸語言方法
★ 基於歷史資訊向量與主題專精程度向量應用於尋找社群問答網站中專家	★ 使用YMCL模型改善使用者意圖分類成效

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

自《維基百科》上線以來，徹底顛覆人們學習新知的方法。最初上線的《英文維基》挾著強勢語種，成為不同語言中最豐富也最多人使用的版本，比起其他版本的文章數量是幾倍之多。以《中文維基》為例，《英文維基》的文章數量是它的六倍，但中文卻是在世界上語言使用人口之冠。我們認為文章數量不平衡的理由有三：一、《中文維基》上線較晚，且較不為人熟知；二、中文使用者大部分居於中國，但中國防火長城阻擋使用者存取中文維基；三、中國傾向扶植國內線上百科—《百度百科》。
此外「跨語言間連結數量不足」也是《維基百科》中一大課題，據統計《英文維基》中的文章具有「內部跨語連結」至中文只有2.3%，而中文連結至英文也只有六成[1]。此現象也發生在其他語言中：日文[4]、德文[5]等等。跨語連結的缺乏不利於全球知識共享，例如：文章論點分析、文化交流、資訊傳遞，以及跨語言相關研究(問答系統、資訊檢索、機器翻譯)等。
如上述，除《維基百科》外，也有類似《百度百科》，提供特定語言的線上百科存在，西班牙語有《Enciclopedia Libre》；德語有《Wikiweise》；俄文有《WikiZnanie》等，因此在「跨線上百科中建置跨語言連結」成了我們的目標，不僅能豐富非英文的文章，也能增加跨語連結，一舉兩得。
我們鎖定《英文維基》和《百度百科》作為目標，透過選取候選條目縮小搜尋範圍
和支持向量機判斷中英兩篇文章是否為相對應的跨語言文章。其中使用機器翻譯、文字相似度、上謂詞、基於分類系統的文章向量作為特徵，提供支持向量機作為判斷依據，在兩個資料集上分別獲得 0.8019(+0.477) 和0.6824(+0.157)平均倒數排名。
因我們連結方法不依賴語言特性，能夠輕易轉換至其他語言，期待未來能套用至更
多線上百科之間；再者本研究特色特徵：「基於分類系統的文章向量」能透過向量表達文章分類，我們更期許能將其應用於其他領域，達成更簡便且有效的成果。

摘要(英)

Our goal is to link corresponding articles from English Wikipedia to Chinese Baidu Baike, which is called “Cross-language Article Linking.” According to Wang et al. (2014) [1]’s statistics, there are only 2.3% English Wikipedia articles link to their Chinese version. On the other hand, Chinese article is 60%. Moreover, the number of articles between those two has a tremendous gap.
Because of the unbalance number of articles and lacking inner cross-language links between different Wikipedia versions, CLAL has become a major issue. Without cross-language links, there are many things cannot be done, for example, global knowledge sharing, cross-language information retrieval, machine translation, machine understanding, etc.
Fortunately, there are other wiki-like online encyclopediae, most of them are local exclusive, such like: “Enciclopedia Libre” in Spanish, “WikiZnanie” in Russian and “Wikiweise” in German. We can take good use of them to link to Wikipedia and solve the unbalance and inner cross-language problem.
We target English Wikipedia and Chinese Baidu Baike. Our approach has two stage: candidate selection and link prediction. Using Candidate selection reduce the search space from entire encyclopedia into 10 candidates. We employ LIBSVM as machine learning tool to predict two articles should be linked or not. The SVM features include title similarity and matching, English title occurrence, hypernym translation and category-based embedding which we special design in this thesis.
In the experiment, we reach 0.8019 and 0.6824 in both datasets, and both of them are better than baseline (+0.477 and +0.157). Because our approach doesn’t rely on language-dependent features, the flexibility let it suits for other wiki-like encyclopediae. For our core design, CBE, We expect that it can apply to variety domain, to make good easy results.

關鍵字(中)

★ 線上百科
★ 跨語言
★ 連結
★ 文章表示向量
★ 維基百科
★ 百度百科

關鍵字(英)

★ online encyclopedia
★ cross-language
★ linking
★ article representation vector
★ Wikipedia
★ Baidu Baike

論文目次

摘要 p.i
Abstract p.ii
Acknowledgments p.iv
Contents p.v
List of Figures p.viii
List of Tables p.ix
1 Introduction p.1
1.1 Motivation p.1
1.1.1 Knowledge in Online Encyclopedia p.2
1.1.2 Situation of Online Encyclopedia p.2
1.2 Problem Description p.4
1.3 Research Objective p.5
1.4 Thesis Organization p.6
2 Related Work p.7
2.1 Cross-Language Article Linking p.7
2.1.1 CLAL across Wikipedia Language Version p.8
2.1.2 CLAL between Wikipedia and Baidu Baike p.9
2.1.3 Others Knowledge Base Linking Task p.10
2.2 Document Representation Method p.11
2.2.1 Term-based Representation p.11
2.2.2 Topic-based Representation p.11
2.2.3 Term-based learning Representation p.11
2.2.4 Knowledge base Representation p.12
3 Methodology p.15
3.1 Problem Definition p.15
3.2 Article in Online Encyclopediae p.16
3.3 Propose Method p.21
3.3.1 Category-Based Embedding (CBE) p.21
3.3.2 English Title Occurrence with Edit Distance (ETO_ED) p.21
3.3.3 Framework p.21
3.4 Candidate Selection p.24
3.5 Link Prediction p.25
3.5.1 Category-Based Embedding (CBE) p.25
3.5.2 English Title Occurrence with Edit Distance (ETO_ED) p.29
3.5.3 Title Matching and Title Similarity (Baseline) p.30
3.5.4 Hypernym Translation (HT) p.31
3.6 Classifier p.32
4 Experiment p.33
4.1 Dataset Description p.33
4.1.1 Description of English Wikipedia p.33
4.1.2 Description of Baidu Baike p.33
4.2 Golden Standard Dataset p.36
4.3 Dataset Preprocessing p.37
4.4 Experiment Setup p.38
4.4.1 Baseline Method Setup p.38
4.4.2 Category-Based Embedding Setup p.39
4.5 Experiment Results p.40
4.5.1 Result of CLAL p.40
4.5.2 Result of Category-Based Embeddings p.41
5 Conclusion p.44
5.1 Summary and Contribution p.44
5.2 Future Work p.45
Bibliography p.46

參考文獻

[1] Y.-C. Wang, C.-K. Wu and R. T.-H. Tsai, “Cross-language and Cross-encyclopedia Article Linking Using Mixed-language Topic Model and Hypernym Translation,” Proceedings of ACL, pp. 586-591, 2014.
[2] M. Jiang, “The business and politics of search engines: A comparative study of Baidu and Google’s search results of Internet events in China,” New Media & Society, vol. 16, no. 2, pp. 212-233, 2014.
[3] Z. Wang, J. Li, Z. Wang and J. Tang, “Cross-lingual knowledge linking across wiki knowledge bases,” Proceedings of the 21st international conference on World Wide Web, ACM, pp. 459-468, 2012.
[4] P. Sorg and P. Cimiano, “Enriching the Crosslingual Link Structure of Wikipedia - A Classification-Based Approach -,” Proceedings of the AAAI 2008 Workshop on Wikipedia and Artifical Intelligence, pp. 49-54, 2008.
[5] J.-H. Oh, D. Kawahara, K. Uchimoto, J. Kazama and K. Torisawa, “Enriching multilingual language resources by discovering missing cross-language links in Wikipedia,” Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, IEEE Computer Society, pp. 322-328, 2008.
[6] Z. Wang, J. Li and J. Tang, “Boosting Cross-Lingual Knowledge Linking via Concept Annotation,” Proceedings of IJCAI, 2013.
[7] S. F. Adafre and M. de Rijke, “Finding similar sentences across multiple languages inWikipedia,” EACL ’06 Workshop on New Text, Wikis and Blogs and Other Dynamic Text Sources, 2006.
[8] K. Kishida, “Technical issues of cross-language information retrieval: a review,” Information Processing & Management. Vol.41, No.3, pp.433-455, 2005.
[9] V. Jijkoun and M. de Rijke, “Overview of the WiQA Task at CLEF 2006,” Workshop of the Cross-Language Evaluation Forum for European Languages, pp.265–274, 2006.
[10] P. Schönhofen, A. Benczúr, I. Bíró and K. Csalogány, “Cross-Language Retrieval with Wikipedia,” Advances in Multilingual and Multimodal Information Retrieval, Lecture Notes in Computer Science, Vol.5152, pp.72-79, 2008.
[11] M. Potthast, B. Stein, and M. Anderka, “A wikipedia-based multilingual retrieval model,” Proceedings of the IR research, 30th European conference on Advances in information retrieval, ECIR’08, pp.522-530, Berlin, Heidelberg, 2008.
[12] R. C. Bunescu and M. Pasca, “Using encyclopedic knowledge for named entity disambiguation,” European Chapter of the Assocation for Computational Linguistics (EACL), 2006.
[13] M. Erdmann, K. Nakayama, T. Hara and S. Nishio, “A bilingual dictionary extracted from the Wikipedia link structure,” Database Systems for Advanced Applications, pp.686-689, Springer Berlin Heidelberg, 2008.
[14] V. Ng, “Supervised noun phrase coreference research: The first fifteen years,” Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 1396-1411, 2010.
[15] W. Wentland, J. Knopp, C. S. J. Knopp, C. Silberer, and M. Hartung, “Building a multilingual lexical resource for named entity disambiguation, translation and transliteration,” Proceedings of the Sixth International Language Resources and Evaluation (LREC’08), Marrakech, Morocco, may 2008.
[16] P. McNamee, J. Mayfield, D. Lawrie, D. W. Oard and D. S. Doermann, “Cross-Language Entity Linking,” IJCNLP, pp. 255-263, 2011.
[17] H. Ji and R. Grishman, “Knowledge base population: Successful approaches and challenges,” Association for Computational Linguistics, 2011.
[18] Spärck Jones, K, “A Statistical Interpretation of Term Specificity and Its Application in Retrieval,” Journal of Documentation, Vol.28, pp.11-21, 1972.
[19] Spärck Jones, K, “Index term weighting,” Information Storage and Retrieval, Vol.9(11), pp.619-633, 1973.
[20] S. Gerard, E. A. Fox, and H. Wu, “Extended Boolean information retrieval,” Communications of the ACM 26:1022-1036, 1983.
[21] S. Deerwester, S. Dumais, T. Landauer, G. Furnas, and R. Harshman, “Indexing by latent semantic analysis,” Journal of the American Society of Information Science, 41(6):391-407, 1990.
[22] C. Papadimitriou, H. Tamaki, P. Raghavan, and S. Vempala, “Latent semantic indexing: A probabilistic analysis,” Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems. ACM, 1998.
[23] T. Hofmann, “Probabilistic latent semantic indexing,” Proceedings of the Twenty-Second Annual International SIGIR Conference, 1999.
[24] D. M. Blei, A. Y. Ng and M. I. Jordan, “Latent dirichlet allocation,” Journal of machine Learning research, vol. 3, pp. 993-1022, 1 2003.
[25] Y. Bengio, R. Ducharme, P. Vincent and C. Jauvin, “A neural probabilistic language model,” journal of machine learning research, vol. 3, pp.1137-1155, 2 2003.
[26] Z. Wang, J. Zhang, J. Feng and Z. Chen, “Knowledge Graph and Text Jointly Embedding,” Proceedings of EMNLP, Citeseer, pp.1591-1601, 2014.
[27] Y. Lin, Z. Liu, M. Sun, Y. Liu and X. Zhu, “Learning Entity and Relation Embeddings for Knowledge Graph Completion,” Proceedings of AAAI, pp.2181-2187, 2015.
[28] Z. Hu, P. Huang, Y. Deng, Y. Gao and E. P. Xing, “Entity Hierarchy Embedding,” Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (ACL-IJCNLP), vol. 1, pp.1292-1300, 2015.
[29] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado and J. Dean, “Distributed representations of words and phrases and their compositionality,” Advances in neural information processing systems, pp.3111-3119, 2013.
[30] Levenshtein, Vladimir I., “Binary codes capable of correcting deletions, insertions, and reversals,” Soviet Physics Doklady. 10 (8): 707-710, February 1966.
[31] M. Beaulieu, M. Gatford, X. Huang, S. Robertson, S. Walker, and P. Williams, “Okapi at TREC- 5,” Proceedings of the fifth Text REtrieval Conference (TREC-5), pages 143-166, 1997.
[32] Chang, Chih-Chung, and Chih-Jen Lin, “LIBSVM: a library for support vector machines,” ACM Transactions on Intelligent Systems and Technology (TIST) 2.3 (2011): 27.
[33] Y.-C. Wang, C.-K. Wu, R. T.-H. Tsai, “Cross-language article linking with different knowledge bases using bilingual topic model and translation features,” Knowledge-Based Systems, Vol.111, pp.228-236, Elsevier, 2016.

指導教授

蔡宗翰(Richard Tzong-Han Tsai)

審核日期

2016-10-26

推文