以文句網路分群架構萃取多文件摘要

NCU Institutional Repository > 管理學院 > 資訊管理研究所 > 博碩士論文 > Item 987654321/65669

請使用永久網址來引用或連結此文件: http://ir.lib.ncu.edu.tw/handle/987654321/65669

題名:	以文句網路分群架構萃取多文件摘要
作者:	黃嘉偉;Huang,Jia-Wei
貢獻者:	資訊管理學系
關鍵詞:	文字探勘;圖形網路;分群方法;多文件摘要;Text mining;Graph-based network;Clustering method;Multi-document Summarization
日期:	2014-07-15
上傳時間:	2014-10-15 17:07:55 (UTC+8)
出版者:	國立中央大學
摘要:	近年由於資訊科技發展迅速，電子文件數量大增加，為避免讀者花費過多時間吸收文件意涵，透過在文件中萃取重要文句製作摘要可幫助讀者快速吸收。然而傳統的文件摘要萃取方法僅透過該文句是否含有重要詞彙去判斷，較無更高層級的概念，如主題等；且摘要萃取文句並未對整個新聞事件做較為全面性之陳述。本研究使用圖形化摘要方法萃取多文件摘要，為指標表示方法(Indicator representation approaches)的一種，將文件切割使用較小的片段表示，本研究採用文句表示。而利用此較小之片段建立起圖形關聯網路後使用分群與數種鏈結分析方法對節點進行評分，並將其群集權重納入評分的考量後使用被選中的節點製作摘要。實驗採用DUC 2002以及TAC2010之資料集測試系統效能，並以ROUGE衡量摘要品質；經實驗證明，本研究之多文件摘要方法在不同的摘要任務下品質皆具有一定程度，在DUC 2002之50字與100字多文件摘要ROUGE-1值分別可達0.2996與0.3412，與當年研討會之參賽者近似之效能，而200字多文件摘要ROUGE-1值亦有0.4559，具有中等效能；在TAC 2010之Guided Summarization之第一部份之ROUGE-1值可達0.3513，超越所有當年參賽者，而ROUGE-2值亦可達0.0707，亦有中等程度之效能。 ;Information technology has developed rapidly in recent years, and the number of electronic documents has increased, too. To avoid readers spend too much time realizing the content of article, it’s useful to help them understand quickly that extracting important sentences and then making summarization. However, the traditional extracting method only judges whether the sentences contain the important terms or not, and it doesn’t use the concept of topic, either. In addition, the traditional extracting method also doesn’t focus on the whole news event to make a comprehensive explanation. This paper uses Graph-based Summarization method to extract multi-document summarization, which is a kind of Indicator representation approaches to divide document in smaller fragment, and this study uses sentence to represent it. After using smaller fragment to build Graph-based network, this paper uses clustering and many kinds of link analysis methods to score the nodes. After that, this study takes cluster weight into consideration for scoring and uses the sentence nodes to make summarization. The experiment uses DUC 2002 and TAC 2010 dataset, and uses ROUGE to evaluation the quality of summarization. The result shows that all the methods can reach a well level. The ROUGE-1 score of DUC 2002 50 words and 100 words can reach 0.2996 and 0.3412, it approximate to the peers in DUC 2002. The ROUGE-1 score of the first part of TAC 2010 Guided Summarization can reach 0.3513, and it’s higher than other peers. Finally, the ROUGE-2 score can reach 0.0707, it also has medium quality.
顯示於類別:	[資訊管理研究所] 博碩士論文

文件中的檔案:

檔案	描述	大小	格式	瀏覽次數
index.html		0Kb	HTML	439	檢視/開啟

在NCUIR中所有的資料項目都受到原著作權保護.

社群 sharing

資料載入中.....