部落格意見檢索系統之設計-部落格內文之擷取與不相關部落格之過濾; Blog Post Extraction and Irrelevant Blog Filtering for Opinion Search Engine

NCU Institutional Repository > 資訊電機學院 > 資訊工程研究所 > 博碩士論文 > Item 987654321/9786

jsp.display-item.identifier=請使用永久網址來引用或連結此文件: http://ir.lib.ncu.edu.tw/handle/987654321/9786

题名:	部落格意見檢索系統之設計-部落格內文之擷取與不相關部落格之過濾;Blog Post Extraction and Irrelevant Blog Filtering for Opinion Search Engine
作者:	楊萍華;Ping-hua Yang
贡献者:	資訊工程研究所
关键词:	內文擷取;意見檢索;部落格;blog post extract;opinion retrieval
日期:	2009-07-23
上传时间:	2009-09-22 11:56:09 (UTC+8)
出版者:	國立中央大學圖書館
摘要:	Blogosphere是由部落格 (Blog) 聚集而成的社群，而部落格在前百最受歡迎的網頁中，其佔有率有逐年增加的趨勢。部落格文章可包含多元主題，文章內容不但具有客觀的事實(objective opinions)且包含主觀的意見(subjective opinions)。以往使用者需要瞭解某特定資訊時，雖然使用者可以透過電視、報章雜誌或者搜尋引擎得到所需資訊，但是透過此方式不但需耗費較多的時間成本且所得到的資訊也較為侷限。因此，在此篇論文中我們整合部落格及搜尋引擎，針對某特定主題來展示大眾的主客觀意見，提供方便及快速尋找意見的部落格意見檢索系統。我們設計的部落格搜尋引擎將回傳的部落格網頁透過兩種方式，分別回傳部落格意見且週期性地更新每一個主題的部落格網頁，以利使用者快速掌握最新意見。首先是線上系統，以少量的固定網域網頁快速回傳意見，其次是在背景執行以大量搜尋部落格網頁增加意見的數量，我們採用不同的部落格搜尋引擎，以不限定部落格網域的方式來搜尋大量的部落格網頁。由於抓取異質性網站的部落格網頁，以人工方式擷取內文擷取可能性不高，因此我們透過機器學習的方式擷取部落格內文區塊，然而大量回傳的網頁包含了許多非部落格的網頁，而這些網頁會降低擷取內文的效果，因此我們藉由機器學習的方式，建立部落格與非部落格網頁的分類器，效果可以達到90.7%(F-Measure)。過濾後的部落格內文擷取效果，結果顯示過濾非部落格的效果可以超過約10% (F-measure)。此外有鑒於一個部落格網頁中的內文區塊與非內文區塊的不平衡比例，即非均衡資料(imbalanced data)，我們也採用了不同的方法處理。最後是過濾相關程度較低的內文，我們增加了擴充主題字的方式，改善原本過濾的效果，提高約61%(F-Measure)。 Blogosphere are consisted of blog is a social network, and blogs which are the most popular in the top websites are increased by years. Blog pages are consisted of variety of topics and posted content is not only included objective opinions but also subjective opinions. In past users could get information by TV, magazine or search engine when they need to know some specific problem, but in those ways not only consume more time cost but also get limited information usually. For these reasons, in this paper we provide an opinion search engine on blogsphere which combines blog and search engine, focus on specific topics to show public opinions. Our blog opinion search engine which returns opinions by two ways, one is online system that responses opinions quickly by few fixed domain pages and the other is background system that update opinion which user can know newer information in large number of blog pages by any domains periodically. Because it is impossible for retrieving blog posted content by manually adding pattern in different blog website, we use machine learning to extract posted content, but those pages which consist of non-blog pages will reduce extraction performance and so we construct a blog and nonblog classifier which F-Measure is 90.7% can filter nonblog pages efficiently and raise extraction performance more than 10% F-Measure. Furthermore, according to positive block and negative blocks in a blog page are unbalanced which are called imbalance data, we adopt different way to solve this. In filtering irrelevant pages we add expansion words in original method which improve about 61% F-measure.
显示于类别:	[資訊工程研究所] 博碩士論文

文件中的档案:

档案	大小	格式	浏览次数

在NCUIR中所有的数据项都受到原著作权保护.

社群 sharing

数据加载中.....