以文本相似度為基礎的段落相似度分析：聖經四福音書之案例研究

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：12

、訪客IP：13.59.113.183

姓名

紀涵文(Han-Wen Chi) 查詢紙本館藏

畢業系所

資訊管理學系

論文名稱

以文本相似度為基礎的段落相似度分析：聖經四福音書之案例研究
(Segment Similarity Based on Text Similarity: A Case Study of Four Gospels)

相關論文

★ 零售業商業智慧之探討	★ 有線電話通話異常偵測系統之建置
★ 資料探勘技術運用於在學成績與學測成果分析 -以高職餐飲管理科為例	★ 利用資料採礦技術提昇財富管理效益 -以個案銀行為主
★ 晶圓製造良率模式之評比與分析－以國內某DRAM廠為例	★ 商業智慧分析運用於學生成績之研究
★ 運用資料探勘技術建構國小高年級學生學業成就之預測模式	★ 應用資料探勘技術建立機車貸款風險評估模式之研究－以A公司為例
★ 績效指標評估研究應用於提升研發設計品質保證	★ 基於文字履歷及人格特質應用機械學習改善錄用品質
★ 以關係基因演算法為基礎之一般性架構解決包含限制處理之集合切割問題	★ 關聯式資料庫之廣義知識探勘
★ 考量屬性值取得延遲的決策樹建構	★ 從序列資料中找尋偏好圖的方法 - 應用於群體排名問題
★ 利用分割式分群演算法找共識群解群體決策問題	★ 以新奇的方法有序共識群應用於群體決策問題

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 ( 永不開放)

摘要(中)

文字探勘（Text Mining）是以資料探勘的方式進行文件的文字資料分析，並透過這些分析取得文字間的相關性，進行分類、比較、判別。近十年來，搜尋引擎崛起，文字探勘的技術被更有效應用，創造新的商業價值。隨著網際網路的日新月異，網路資料量的累積使得搜尋引擎的發展愈發快速，改寫了資料檢索不變的定律。
文本相似度（Text Similarity）透過將文字型態之間予以權重（或做：距離），計算文字型態間的相似程度，並加總比較以取得資訊、分類或二元判斷。透過此方法將大量的文章段落進行分析，並取得富含價值的有用資訊。
本研究將提出一個新的相似度比對方法。我們將文件中任意一個連續的文字視為一個段落（Segment），將此段落與其他句子之間比對獲得評分，並從評分分數的高低與分佈，找出在同一文件中相似的目標段落。本研究以聖經四福因書作為案例，演示演算法運作方式與預期結果，並針對不同的參數之欲其結果進行比較。

摘要(英)

Text Mining is known as data analysis to documents based on data mining. Main purpose of text mining is to obtain the relevance between text, through these analyzes conclude classification, comparison and discrimination. Over the past decade, search engines have emerged, and text search techniques have been more effectively applied
to create new business value. With the ever-changing Internet, the accumulation of information on the network makes the development of search engines more quickly, also makes a huge on change data retrieval.
Text Similarity, the degree of similarity between the text types is calculated by weighting (distance). Calculate the degree of similarity between text types and obtain information, classify or binary judgments, observe the valuable information through analysis a big quantity of articles.
In this research, we raised a new method of similarity calculation. We treat any part of continuous sentences in the document as a Segment. Compare this segment with other sentences to get scores, and find the similar target segment in the same document from the rank and distribution of the scores. In this research, we use the four gospels in holy bible as cases study. The cases study demonstrate the operation of the algorithm and the expected results.

關鍵字(中)

★ 文本相似度
★ 段落相似度
★ 聖經經文

關鍵字(英)

★ Text Similarity
★ Segment Similarity
★ Bible
★ Latent semantic analysis(LSA)

論文目次

一、緒論.................................... 1
1.1 研究背景與動機.......................... 1
1.2 研究目的 .............................. 3
1.3 論文架構 ................................4
二、文獻探討............................ 5
2.1 資料前處理............................ 5
2.2 屬性挑選 ............................. 8
2.3 建立向量 .......................... 9
2.4 降低維度 ........................... 10
2.5 計算相似度 ......................... 15
2.6 句子相似度 .......................... 17
2.7 詞語相似度 ...................18
三、研究方法........................... 19
3.1 研究資料 .......................... 19
3.2 階段一:資料前處理....................20
3.3 階段二:計算相似度............................. 21
3.4 階段三：計算輸入段落相似度 .................... 22
3.5 階段四：找出經節群集作為候選段落 ................22
3.6 階段五：取得目標段落 ........................ 24
四、案例.......................... 25
4.1 案例一:馬太福音十章1 至16 節 ................. 25
4.2 案例二:馬可福音十章46 至52 節 ................ 30
4.3 驗證指標 ..................... 34
五、結論.............................. 35
六、參考文獻.................................... 37

參考文獻

[1] 李淑惠, (2014), 運用文字探勘技術於口碑分析之研究, 碩士, 東吳大學資訊管理學系。
[2] H.Gomaa, W. and A. Fahmy, A. (2013). A Survey of Text Similarity Approaches. International Journal of Computer Applications, 68(13), pp.13-18.
[3] LIU, Q. and LI, S. (2002). Word Similarity Computing Based on How-net. The Association for Computational Linguistics and Chinese Language Processing, [online] 7(2), pp.59-76. Available at: https://aclweb.org/anthology/O/O02/O02-2003.pdf [Accessed 29 Jun. 2017].
[4] Cheng, S. and Liang, T. (2005). 中⽂句⼦相似度之計算與應用 (ChineseSentence Similarity Computing and Appling) [In Chinese]. ROCLING, pp.1-2.
[5] Gan, Z. (2017). A Document Similarity Measure and Its Applications. NSYSU.
[6] Kruse, H. and Mukherjee, A. (n.d.). Preprocessing text to improve compression ratios. Proceedings DCC ′98 Data Compression Conference (Cat. No.98TB100225).
[7] Yao, Z. and Ze-wen, C. (2011). Research on the Construction and Filter Method of Stop-word List in Text Preprocessing. 2011 Fourth International Conference on Intelligent Computation Technology and Automation.
[8] Saad, M. K. (2010). The impact of text preprocessing and term weighting on arabic text classification. Gaza: Computer Engineering, the Islamic University.
[9] Wilbur, W. J., & Sirotkin, K. (1992). The automatic identification of stop words. Journal of information science, 18(1), 45-55.
[10] El-Khair, I. A. (2006). Effects of stop words elimination for Arabic information retrieval: a comparative study. International Journal of Computing & Information Sciences, 4(3), 119-133.
[11] Paice, C. D. (1994, August). An evaluation method for stemming algorithms. In Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 42-50). Springer-Verlag New York, Inc..
[12] Hull, D. A. (1996). Stemming algorithms: A case study for detailed evaluation. JASIS, 47(1), 70-84.
[13] Lovins, J. B. (1968). Development of a stemming algorithm. Mech. Translat. & 38 Comp. Linguistics, 11(1-2), 22-31.
[14] Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14(3), 130-137.
[15] Jivani, A. G. (2011). A comparative study of stemming algorithms. Int. J. Comp. Tech. Appl, 2(6), 1930-1938.
[16] Li, H., Cao, Y., Petzold, L. R., & Gillespie, D. T. (2008). Algorithms and software for stochastic simulation of biochemical reacting systems. Biotechnology progress, 24(1), 56-61.
[17] Dijkman, R. M., Dumas, M., & García-Bañuelos, L. (2009, September). Graph Matching Algorithms for Business Process Model Similarity Search. In BPM(Vol. 5701, pp. 48-63).
[18] Yang, Y., & Pedersen, J. O. (1997, July). A comparative study on feature selection in text categorization. In Icml (Vol. 97, pp. 412-420).
[19] Ikonomakis, M., Kotsiantis, S., & Tampakas, V.(2005). Text classification using machine learning techniques. WSEAS transactions on computers, 4(8), 966-974.
[20] Figueroa, Alejandro (2015). Exploring effective features for recognizing the user intent behind web queries. Computers in Industry, 68, 162–169.
[21] Zhang, Y., Wang, S., Phillips, P. and Ji, G. (2014). Binary PSO with mutation operator for feature selection using decision tree applied to spam detection. Knowledge-Based Systems, 64, pp.22-31.
[22] López, F. G., Torres, M. G., Batista, B. M., Pérez, J. A. M., & Moreno-Vega, J. M. (2006). Solving feature subset selection problem by a parallel scatter search. European Journal of Operational Research, 169(2), 477-489.
[23] Garcıa-Torres, M., Garcıa-López, F., Melián-Batista, B., Moreno-Pérez, J. A., & Moreno-Vega, J. M. (2004). Solving feature subset selection problem by a hybrid
metaheuristic. Hybrid Metaheuristics, 59-68.
[24] Ramos, J. (2003, December). Using tf-idf to determine word relevance in document queries. In Proceedings of the first instructional conference on machine learning (Vol. 242, pp. 133-142).
[25] Aizawa, A. (2003). An information-theoretic perspective of tf–idf measures. Information Processing & Management, 39(1), 45-65.
[26] Li, K. C. (1991). Sliced inverse regression for dimension reduction. Journal of the American Statistical Association, 86(414), 316-327.
[27] Fodor, I. K. (2002). A survey of dimension reduction techniques (No. UCRL-ID-148494). Lawrence Livermore National Lab., CA (US).
[28] Wold, S., Esbensen, K., & Geladi, P. (1987). Principal component analysis. Chemometrics and intelligent laboratory systems, 2(1-3), 37-52.
[29] Abdi, H., & Williams, L. J. (2010). Principal component analysis. Wiley interdisciplinary reviews: computational statistics, 2(4), 433-459.
[30] Schölkopf, B., Smola, A., & Müller, K. R. (1997). Kernel principal component analysis. Artificial Neural Networks—ICANN′97, 583-588.
[31] Landauer, T. K., Foltz, P. W., & Laham, D. (1998). An introduction to latent semantic analysis. Discourse processes, 25(2-3), 259-284.
[32] Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American society for information science, 41(6), 391.
[33] Blog.csdn.net. (2015). [online] Available at: http://blog.csdn.net/zhzhji440
[Accessed 6 Jul. 2017].
[34] T. W. Schoenharl and G. Madey. Evaluation of measurement techniques for the validation of agent-based simulations against streaming data. International Conference on Computational Science, 2008.
[35] J. Han and M. Kamber. Data Mining: Concepts and Techniques. Second Edition, Morgan Kaufmann, Elsevier, 2006.
[36] Li, Y., McLean, D., Bandar, Z. A., O′shea, J. D., & Crockett, K. (2006). Sentence similarity based on semantic nets and corpus statistics. IEEE transactions on
knowledge and data engineering, 18(8), 1138-1150.
[37] Achananuparp, P., Hu, X., & Shen, X. (2008). The evaluation of sentence similarity measures. Data warehousing and knowledge discovery, 305-316.
[38] Islam, A., & Inkpen, D. (2008). Semantic text similarity using corpus-based word similarity and string similarity. ACM Transactions on Knowledge Discovery from
Data (TKDD), 2(2), 10.
[39] Gomaa, W. H., & Fahmy, A. A. (2013). A survey of text similarity approaches. International Journal of Computer Applications, 68(13).
[40] Cilibrasi, R. L., & Vitanyi, P. M. (2007). The google similarity distance. IEEE Transactions on knowledge and data engineering, 19(3).

指導教授

陳彥良(Yen-Liang Chen)

審核日期

2017-7-25

推文