以文件間差異為基礎並實作中文摘要

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：8

、訪客IP：3.17.166.149

姓名

黃慶杰(Ching-Jie Huang) 查詢紙本館藏

畢業系所

資訊管理學系

論文名稱

以文件間差異為基礎並實作中文摘要

相關論文

★ 網路合作式協同教學設計平台－以國中九年一貫課程為例	★ 內容管理機制於常用問答集(FAQ)之應用
★ 行動多重代理人技術於排課系統之應用	★ 存取控制機制與國內資安規範之研究
★ 信用卡系統導入NFC手機交易機制探討	★ App應用在電子商務的推薦服務-以P公司為例
★ 建置服務導向系統改善生產之流程-以W公司PMS系統為例	★ NFC行動支付之TSM平台規劃與導入
★ 關鍵字行銷在半導體通路商運用-以G公司為例	★ 探討國內田徑競賽資訊系統－以103年全國大專田徑公開賽資訊系統為例
★ 航空地勤機坪作業盤櫃追蹤管理系統導入成效評估—以F公司為例	★ 導入資訊安全管理制度之資安管理成熟度研究－以B個案公司為例
★ 資料探勘技術在電影推薦上的應用研究-以F線上影音平台為例	★ BI視覺化工具運用於資安日誌分析—以S公司為例
★ 特權帳號登入行為即時分析系統之實證研究	★ 郵件系統異常使用行為偵測與處理-以T公司為例

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

本研究提出以文件間差異的摘要方式實作多文件摘要，有別於單一架構實作多文件摘要，改善摘要文句來自於少數或單一子概念主題，並且避免單一主題追蹤時，摘要文句取自於非相關文件的相關文句，以非監督擷取式圖形化摘要方法實現單一與多文件摘要，方法中使用到的語義詞彙網路是依據最新的維基百科資料集，再使用單一文件摘要為基礎利用文句特徵中文句位置特性逐一挑選各文件中的第一個文句，過程中若使用不同的順序處理多文件摘要，能夠得到主題發展與主題集中的兩種概念摘要，使文件摘要能有更多不同的應用，實驗探討詞彙網路所使用的新維基百科資料集對於摘要品質的測試，發現資料集的更新並無顯著影響研究的參數值，本研究所提出的方法實作DUC 2002的英文摘要，品質與其他參賽者比較，單一文件摘要得到中間以上的排名，而多文件摘要維持在中間排名，另外中文摘要使用BBC中文網的新聞資料集，標題為能彰顯文件主題的文字，因此本研究將它視為文件的概念主題，利用概念主題與查詢主題做相似度運算探討主題追蹤效果，針對主題集中及發展性的新聞進行實作，結果發現主題集中的摘要文句多著重於主要主題上，而主題發展的摘要文句能有效的擷取出文件間子主題概念。

摘要(英)

This study proposed a way difference from Single-layer architecture based on inter-document to implement multi-document summary. This method improved the problem that summary was composed of the sentence in single or little sub-concepts, and that summary extracted the related sentence from unrelated document while topic tracking. The system applied an unsupervised graph-based extractive summarization, and the semantic relationship between terms was dependent on latest Wikipedia dataset. Multi-document summary used the concept of sentence-position in basic feature summarization by choosing the first sentence in each single-document summary. Through the process, there were two concept summaries of topic development and focus by different sequence to extract multi-document summary. The result of the investigation the new Wikipedia dataset whether influenced the parameters was not significant, and the performance of the method this study proposed with DUC 2002 dataset comparing to other participants in the single summary was above the middle of the rank, and in the multi-document summary is in the middle of the rank. The finding of the concept summary of topic focus and development with BBC Chinese news was the summary tended to primary concept in the topic focus and to sub-concept in the topic development. The effect of the topic tracking was calculating the similarity between title of the documents, because the title was the words to demonstrate the content. After the experiment, this way could effectively identify the related document.

關鍵字(中)

★ 文件間差異
★ 文句位置
★ 擷取式摘要
★ 多文件摘要
★ 中文摘要
★ 主題追蹤

關鍵字(英)

★ Inter-document based
★ Sentence position
★ Extractive Summarization
★ Multi-document summarization
★ Chinese summarization
★ topic tracking

論文目次

摘要 i
Abstract ii
致謝 iii
目錄 iv
圖目錄 viii
表目錄 ix
一、緒論 1
1-1 研究背景 1
1-2 研究動機 2
1-3 研究目的 5
1-4 論文架構 6
二、文獻探討 7
2-1 自動文件摘要 7
2-2 單文件到多文件摘要 8
2-3 文句特徵摘要方法 9
2-3-1 文件標題 10
2-3-2 文句長度 10
2-3-3 文句位置 11
2-3-4 數據資料 11
2-3-5 主題字 12
2-3-6 小結 12
2-4 圖形化摘要方法 13
2-5 鏈結分析方法 13
2-5-1 Degree和Strength 14
2-5-2 K-core 15
2-5-3 Locality Index 16
2-5-4 PageRank 17
2-6 正規劃Google相似度距離 18
2-7 餘弦相似度 19
2-8 組合排名方法 20
2-9 文件內容與標題之間關聯性 20
2-10 中文斷詞 21
2-11 中文維基百科資料集 22
三、系統設計與架構 23
3-1 系統概念與流程 23
3-2 系統環境的建置 24
3-2-1 Jetty 24
3-2-2 Solr全文檢索系統 25
3-2-3 維基百科資料集 26
3-3 單文件摘要系統 27
3-3-1 資料集前處理 27
3-3-1-1 原始資料集處理 27
3-3-1-2 詞性組合 29
3-3-1-3 字詞長度 30
3-3-2 建立詞彙網路 30
3-3-2-1 詞彙維基搜尋結果數 30
3-3-2-2 建立詞彙間連線 30
3-3-2-3 找尋關鍵詞彙組 31
3-3-3 建立文句網路 31
3-3-3-1 文句轉向量 32
3-3-3-2 建立文句網路矩陣 32
3-3-4 節點評分 33
3-3-4-1 Degree 33
3-3-4-2 Strength 34
3-3-4-3 K-core 34
3-3-4-4 Locality Index 34
3-3-4-5 PageRank 35
3-3-5 文句排名與挑選文句 35
3-4 多文件摘要系統 36
3-4-1 文件間前處理 36
3-4-2 擷取摘要文句 37
3-5 主題追蹤 38
3-5-1 標題文句向量 39
3-5-2 標題文句相似度 39
四、實驗設計與結果 40
4-1 實驗環境 40
4-2 資料集 40
4-3 評分方法 41
4-4 實驗結果與討論 42
4-4-1 實驗一：不同維基百科搜尋結果數門區間對摘要品質的影響 43
4-4-2 實驗二：關鍵詞彙的NGD門檻值 44
4-4-3 實驗三：餘弦相似度門檻值對摘要品質的影響 45
4-4-4 實驗四：單一文件摘要摘要品質表現 47
4-4-5 實驗五：以文件間為差異的多文件摘要品質表現 49
4-4-6 實驗六：本研究系統摘要品質評估 52
4-4-7 實驗七：中文多文件摘要實作結果 56
4-4-8 實驗八：主題追蹤過濾找出相關文件 59
五、結論與未來研究方向 61
5-1 結論 61
5-2 未來研究 63
參考文獻 65

參考文獻

中文部分：
〔1〕王蓮淨（2015），以主題事件追蹤為基礎之摘要擷取，碩士論文，國立中央大學資訊管理研究所。
〔2〕黃嘉偉（2014），以文句網路分群架構萃取多文件摘要，碩士論文，國立中央大學資訊管理研究所。
〔3〕楊佩臻（2013），利用文句關係網路自動萃取文件摘要之研究，碩士論文，國立中央大學資訊管理研究所。
〔4〕鄭奕駿（2012），離線搜尋Wikipedia以縮減NGD運算時間之研究，碩士論文，國立中央大學資訊管理研究所。

英文部分：
〔5〕 Abuobieda A., Salim N., Albaham A. T., Osman A.H., Kumar Y. J. (2012), “Text Summarization Features Selection Method using Pseudo Genetic-based Model,” International Conference on Information Retrieval & Knowledge Management.
〔6〕 Antiqueira L., Jr. O. N. O., Costa, L. D. F., and Nunes, M. D. G. V. (2009), “A complex network approach to text summarization,” Information Sciences, Vol.179, pp. 584-599.
〔7〕 Barry Schwartz (2003), The Paradox of choice:Why More Is Less, HarperCollins.
〔8〕 C. Lopez, V. Prince, and M. Roche (2014), “How can catchy titles be generated without loss of informativeness? ,” Expert Syst. Appl., vol. 41, no. 4 PART 1, pp. 1051–1062, 2014.

〔9〕 Cilibrasi, R.L. and Vitanyi, P.M.B. (2007), “The Google Similarity Distance,” IEEE Transactions on Knowledge and Data Engineering, Vol.19, No.3, pp, 370-383.
〔10〕 D. R. Radev, E. Hovy, and K. McKeown (2002), “Introduction to the special issue on summarization,” Comput. Linguist., vol. 28, no. 4, pp. 399–408
〔11〕 Luís Marujo, Wang Ling, Ricardo Ribeiro, Anatole Gershman, Jaime Carbonell, David Martins de Matos, João P. Neto (2016), “Exploring events and distributed representations of text in multi-document summarization,” Knowledge-Based Systems, Vol.94, pp. 33–42
〔12〕 P. I. Chen and S. J. Lin (2011), “Word AdHoc Network: Using Google Core Distance to extract the most relevant information,” Knowledge-Based Syst., vol. 24, no. 3, pp. 393–405

〔13〕 R. Mihalcea(2005), Language independent extractive summarization, Proceedings of the ACL Interactive Poster and Demonstration Sessions, pp. 49–52.
〔14〕 Wald, R., Khoshgoftaar, T. M., Dittman, D., Awada, W. and Napolitano, A. (2012), “An extensive comparison of feature ranking aggregation techniques in bioinformatics,”
The 13th IEEE International Conference on Information Reuse and Integration, Las Vegas, USA August 8–10, 2012.
〔15〕 Zhang, Z., Ge, S. S., and He, H. (2012), “Mutual-reinforcement document summarization using embedded graph based sentence clustering for storytelling,” Information Processing and Management, Vol.48, pp.767–778.

資料庫或網頁資料：
〔16〕 FUKUBALL，結巴分詞系統，取自：https://github.com/fxsjy/jieba
〔17〕中研院（無日期），中文斷詞系統CKIP，檢自：http://ckipsvr.iis.sinica.edu.tw
〔18〕維基百科（2016年2月21日），字詞轉換處理，檢自：https://zh.wikipedia.org/wiki/Wikipedia:字詞轉換處理
〔19〕維基百科（2016年6月25日），繁簡轉換，檢自：https://zh.wikipedia.org/wiki/繁簡轉換

指導教授

林熙禎(She-Jen Lin)

審核日期

2016-7-25

推文