A Similarity-based Method to Retrieve Bilingual Documents from the Theses and Dissertation Database

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：145

、訪客IP：18.116.40.177

姓名

張保擏(Hendy Sulistio) 查詢紙本館藏

畢業系所

資訊管理學系

論文名稱

(A Similarity-based Method to Retrieve Bilingual Documents from the Theses and Dissertation Database)

相關論文

★ 零售業商業智慧之探討	★ 有線電話通話異常偵測系統之建置
★ 資料探勘技術運用於在學成績與學測成果分析 -以高職餐飲管理科為例	★ 利用資料採礦技術提昇財富管理效益 -以個案銀行為主
★ 晶圓製造良率模式之評比與分析－以國內某DRAM廠為例	★ 商業智慧分析運用於學生成績之研究
★ 運用資料探勘技術建構國小高年級學生學業成就之預測模式	★ 應用資料探勘技術建立機車貸款風險評估模式之研究－以A公司為例
★ 績效指標評估研究應用於提升研發設計品質保證	★ 基於文字履歷及人格特質應用機械學習改善錄用品質
★ 以關係基因演算法為基礎之一般性架構解決包含限制處理之集合切割問題	★ 關聯式資料庫之廣義知識探勘
★ 考量屬性值取得延遲的決策樹建構	★ 從序列資料中找尋偏好圖的方法 - 應用於群體排名問題
★ 利用分割式分群演算法找共識群解群體決策問題	★ 以新奇的方法有序共識群應用於群體決策問題

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

現代的電子文件的量已經巨大地增長，網路科技使用戶獨立地分享信息和知識。語言用來寫文件也有很多種。這種現象引導我們發會放法能精確地檢索文件和以能力解決语言障隘。
本這次研究，我們發會相似度放法用來從論文和學術論文系統檢索雙語科學文件。我們計算雙語文件相似度(漢語和英語)。結合一個检索系统以能力解決语言障隘是富挑戰性任務。
在我們的研究的每個科學文件被劃分成4個領域：標題、主題詞、摘要和被援引的參考。要計算每個領域相似度我們使用一個不同的演算法。我們的方法學的結果表示，我們的方法學能準確地檢索雙語文件。

摘要(英)

Electronic documents have grown tremendously in quantity nowadays, the internet technology enable users to share information and knowledge independently. The language which is used to write the documents might also variant. This phenomenon has leads us to develop a methodology which can retrieved documents precisely and with the ability to solve language barrier.
In this research we develop a similarity-based methodology to retrieve bilingual scientific documents from Theses and Dissertation System. We compute the similarity of bilingual documents (Chinese and English). Integrated a retrieval system with the ability to solve language barrier is a challenging tasks.
Every scientific document in our research is divided into 4 fields which are: Title, Keyword, Abstract, and Cited Reference. To compute a similarity of every field we used a different technique. The result of our methodology shows that our methodology is able to retrieve bilingual documents accurately.

關鍵字(中)

關鍵字(英)

★ Bilingual
★ Similarity-based
★ Text Mining

論文目次

Chapter 1 Introduction 1
1.1. Research Background 1
1.2. Research Motivation 2
1.3. Research Purpose 3
1.4. Research Flow 4
1.5. Theses Structure 5
Chapter 2 Literature Review 6
2.1. Document Preprocessing 6
2.1.1 Document Preprocessing (Chinese Documents) 6
2.1.2 Document Preprocessing (English Documents) 12
2.2. Translation of Documents 16
2.2.1 Statistical Machine Translation 16
2.2.2 Bilingual Comparable Text Corpora 18
2.3. Document Matching 18
2.4. Related Technology and Method 19
2.4.1 Information Retrieval 19
Chapter 3 Methodology 23
3.1. Translation Process 23
3.2. Similarity Based Method 33
Chapter 4 Experiment 47
4.1. Experiment Environment and Data 47
4.2. Experiment Design 47
4.3. Experiment Result 51
Chapter 5 Conclusion 54
5.1. Discussion 54
5.2. Future Research 55
References 56

參考文獻

References
[1] Amer-Yahia, S., Botev, C. and Shanmugasundaram, J., 2004. TeXQuery: A Full-Text Search Extension to XQuery. In Proceedings International WWW Conference, New York, USA.
[2] Utsuro Takehito., Ikeda Illiroshi., Yamane Masaya., 2003. Bilingual Text, Matching using Bilingual Dictionary and Statistics.
[3] Baeza-Yates, R. and Ribeiro-Neto, B., 1999. Modern Information Retrieval. New York: The ACM Press.
[4] Buckley, C., SMART, Version 7.
[5] Callan, J.P., Croft, W.B. and Harding, S.M., 1995. The INQUERY Retrieval System. In DEXA 3. International Conferrence on Database and Expert Systems Applications, pp. 83-97, Berlin: Springer Verlag.
[6] Cohen, W., June 1998. Integration of Heterogeneous databases Without Common Domains Using Queries Based on Textual Similarity. In Proceeding ACM SIGMOD, 27(2): pp. 201-212, Seattle, WA.
[7] CORDIS: Community Research & Development Information Service, http://www.cordis.lu/en/home.html.
[8] Cutting, D. and Pedersen, J., 1990. Optimizations for Dynamic Inverted Index Maintenance. The 13th International Conference on Research and Development in Information Retrieval, pp. 405-411.
[9] Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K. and Harshman, R., 1990. Indexing by Latent Semantic analysis. Journal of the American Society for Information Sciences, 41, 6, pp. 391-407.
[10] Dickson, G.W., Senn, J.A. and Chervany, N.L., May 1977. Research in Management Information Systems: The Minnesota Experiments. Management Science, vol. 23, no. 9, pp. 913-923.
[11] Doszkocs, T.E., 1983. From Research to Application: The CITE Natural Language Information Retrieval System. In Research and Development in Information Retrieval, Salton, G. and Schneider, H.J., eds. (Lecture Notes in Computer Science Series, 146) Berlin: Springer-Verlag, pp. 251-262.
[12] Dumais, S.T., 1991. Improving the retrieval of information from external sources. Behavior Research Methods, Instruments & Computers, vol. 23, no. 2, pp. 229-236.
[13] Ellman, J., 2000. Using Roget's Thesaurus to Determine the Similarity of Texts. Ph.D. Thesis, School of Computing, Engineering and Technology, University of Sunderland, England.
[14] Fagan, J.L., March 1989. The Effectiveness of a Nonsyntactic Approach to Automatic Phrase Indexing for Document Retrieval. Journal of the American Society for Information Science (ASIS), Vol. 40, Iss. 2, pp. 115-132.
[15] Fellbaum, C., 1998. WordNet: An Electronic Lexical Database. MIT Press, Cambridge, MA.
[16] Fontaine, A., May 1995. Sub-element indexing and probabilistic retrieval in the POSTGRES database system. Technical Report CSD-95-876, University of California at Berkeley. ftp://s2k-ftp.CS.Berkeley. EDU/pub/postgres/papers/.
[17] Fox, C., 1990. A stop list for general text. SIGIR Forum 20(12), pp. 19-35.
[18] Frakes, W.B. and Fox, C.J., 2003. Strength and similarity of affix removal stemming algorithms. SIGIR Forum 37(1): pp. 26-30.
[19] Geffet, M. and Feitelson, D.G., Jun 2001. Hierarchical indexing and document matching in BoW. In first ACM/IEEE Joint Conferrence Digital Libraries, pp. 259-267.
[20] George Allan Alderman III, M.A., 2000. Information Retrieval using an adaptive resonance theory (ART)-based Neural Net. Ph.D. dissertation, Georgetown University, UMI Number: 9978116.
[21] Grossman, D.A. and Frieder O., 1998. Information Retrieval: algorithms and heuristics. Boston: Kluwer.
[22] Hammouda, K. and Kamel, M., 2004. Document Similarity Using a Phrase Indexing Graph Model. Knowledge and Information Systems, vol. 6, no. 6, pp. 710-727.
[23] ISI Web of Knowledge, Version 3.0, http://isi01.isiknowledge.com/portal.cgi.
[24] Korfhage, R.R., 1997. Information Storage and Retrieval. N.Y.: John Wiley and Sons.
[25] Kowalski, G.J. and Maybury, M.T., 2000. Information Storage and Retrieval Systems: Theory and Implementation. Kluwer International Series on Information Retrieval, Inre 8. Kluwer Academic.
[26] Lee, K.H., Choy, Y.C. and Cho, S.B., 2004. An Efficient Algorithm to Compute Differences between Structured Documents. IEEE Transactions on Knowledge and Data Engineering, 16(8): pp. 965-979.
[27] Lin, D., 1997. Using Syntactic Dependency as Local Context to Resolve Word-Sense Ambiguity. In Proceedings of the Thirty-Fifth Annual Meeting of the Association for Computational Linguistics. Somerset, N.J.: Association for Computational Linguistics.
[28] Chen, Y.L., Wei, Jhong Jhih., Wu, Shin-Yi., Hu, Ya-Han., 2005. A Similarity-based method for retrieving documents from the SCI/SSCI database. Journal of Information Science.
[29] Meadow, C.T., Wang, J. and Stamboulie, M., 1993. An Analysis of Zipf-Mandelbrot Language Measures and Their Application to Artificial Languages. Journal of Information Science, 19(4): pp. 247-258.
[30] Meadow, C.T., Boyce, B.R., and Kraft, D.H., 2000. Text Information Retrieval Systems. 2nd edition. San Diego: Academic Press.
[31] Michaelj, A.B., 1997. Data Mining Techniques For Marketing, sales, and Customer Support. Wiley Computer Publishing.
[32] Miller, G., Beckwith, R., Fellbaum, C., Gross, D. and Miller, K., 1990. Introduction to WordNet: An on-line lexical database. J. Lexicography 3(4): pp. 235-244.
[33] Miller, G.A., 1995. WorldNet: a lexical database for English. Communications of the ACM, 38(11): pp. 39-41.
[34] Ng, H.T. and Zelle, J., 1997. Corpus based approaches to semantic interpretation in natural language processing. AI Magazine, 18(4): pp. 25-31.
[35] Palo Alto, C.A., 1987. Dialog Information Services. DIALOG System Seminar Manual, Problem Set 3.1.1, pp. 20.
[36] Gale, W.A. and Church, K. W. (1993). A program for aligning sentences in bilingual corpora, Computation Linguisties 19(1): 75 102.
[37] Gale, W. and Church, K (1991). Identifying word correspondences in parallel texts, Proceeding of 4th DARPA speech and natural language Workshop, pp. 152-157.
[38] Littman, L. Michael., Dumais, Susan.T., Landaner, Thomas.K (1996). Automatic Cross-Language Information Retrieval using Latent Semantic Indexing.
[39] Gao, Jianfeng., Wu, Andi., Li, Mu., Huang, Chang-Ning (2006). Chinese Word Segmentation and Named Entity Recognition: A Pragmatic Approach. Association for Computational Linguistics
[40] Zanettin, Federico. 1998. Bilingual Comparable Corpora and the Training of Translators. Meta: Journal des traducteurs/Meta: Translators’ Journal, vol.43 n 4, 1998, p. 616-630.
[41] Navigli, Roberto. 2009. Word Sense Disambiguation: A survey. ACM Computing Surveys, Vol. 41, No. 2, Article 10.
[42] Lu, Xiaofei. 2007. Combining Machine Learning with Linguistic Heuristics for Chinese Word Segmentation. Departement of Linguistics and Applied Language Studies. Pennysylvania State University, USA.
[43] Bharati, Akshar., V, Sriram., A, Vamshi Krishna., Sangal, Rajeev., Bendre, Sushma. 2002. An algorithm for aligning Sentences in Bilingual Corpora Using Lexical Information.
[44] Zou, Feng., Wang, Fu Lee., Deng, Xiaotie., Han, Song. 2006. Automatic Identification of Chinese Stop Words. Computer Science Department, City University of Hong Kong Kowloon Tong, Hong Kong. Advances in Natural Language Processing Research in Computing Science 18, 2006, pp. 151 – 162.
[45] Brown, P.F., Cocke, J., Della Pietra, S.A., Della Pietra, V.J., Jelinek, F., Lafferty, J.D., Mercer, R.L., Roossin, P.S.: A statistical approach to machine translation. Computational Linguistics 16 (1990) 79-85
[46] Brown, P.F., Della Pietra, S.A., Della Pietra, V.J., Mercer, R.L.: The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics 19 (1993) 263 - 311

指導教授

陳彥良(Yen-Liang Chen)

審核日期

2009-7-21

推文