部落格意見檢索及摘要之研究;Opinion and Sentiment Analysis from Blogosphere for Social Mining

NCU Institutional Repository > 資訊電機學院 > 資訊工程學系 > 研究計畫 > Item 987654321/57357

請使用永久網址來引用或連結此文件: http://ir.lib.ncu.edu.tw/handle/987654321/57357

題名:	部落格意見檢索及摘要之研究;Opinion and Sentiment Analysis from Blogosphere for Social Mining
作者:	張嘉惠
貢獻者:	中央大學資訊工程學系
關鍵詞:	資訊科學--軟體
日期:	2009-09-01
上傳時間:	2012-10-01 15:19:06 (UTC+8)
出版者:	行政院國家科學委員會
摘要:	在部落格（Blog）這個代表Web2.0 觀念的資訊範疇，有許多內容是使用者個人意見的表現。這些眾人的意見雖非知識真理，但卻是相當有用的資訊。舉例來說，別人意見可以做為購買產品的參考，客戶的回饋評論也可以是產品設計的改善，名人常常關注網路上自己的名聲，企業們也想要知道自己產品可能的銷售情況。而這些原本難以取得的資訊，藉由個人部落格的盛行，讓我們可以提供諸如上述主題的意見調查。本計劃的目標即是建置一套基於部落格意見文章之線上意見摘要系統（On-Line Opinion Summarization System），能夠針對使用者所給定的主題，以同步及非同步傳輸模式來呈現不同層次的意見摘要，同時針對意見持有者（Opinion Holder）提供統計圖分析，讓查詢者可以了解這些意見的來源。計畫執行第一年期間主要實作整體系統雛形架構，並且呈現第一層次意見摘要，我們將利用現有的部落格搜尋引擎（Google Blog Search、Technorati 等），找出有關包含查詢主題的部落格文章，並從網頁中去蕪存菁擷取真正的內文，再進行意見的檢索及傾向的分類。此外為有效率地獲得正反兩面的訓練資料集，除透過既有的情緒字典輔助外，亦利用Wrapper 技術來蒐集意見分享網站（epinion、Amzaon、RateitAll、Complaints 等）的正反面評價資料。計畫第二年期間主要執行觀點分析（Aspect Analysis），並將結果以AJAX 介面以背景連結自動送逹使用者端，以呈現第二層次意見摘要。除了意見摘要資訊外，本系統將更深入地去探討部落格作者的背景資料（性別、國籍等），藉此呈現不同維度的資訊來探究主題趨勢。對於未提供資訊的作者，我們也將從訓練的模型中來預測其背景資訊。計畫第三年將以雲端計算為主軸，針對系統的資料及使用者的歷史紀錄做有效的管理，以便於後續更有效地查詢與利用。同時我們也將擴展至中文部落格，我們將整合我們實驗室過去在中文斷句及未知詞方面的處理技術，並運用在中文意見擷取及摘要上。。本系統預期以廣泛的中英文熱門主題（3C 商品、政治、社會事件等）進行實驗，採取傳統的資訊檢索度量（精確率、召回率、平均準確率）來評估其有效性，並以系統回覆時間來評估其效能性。 ; Blog, the representative service in Web 2.0, provides a platform for users to publish personal opinions. Collecting such opinions from public could provide what is called “crowd wisdom”. For instance, individuals may refer to others’opinions when buying a specific product; manufactures may improve their merchandises from customer feedbacks; celebrities are constantly concerning about his/her reputation on the web; and businesses would like to estimate a product sale. Thanks to blogs, opinions that are hard to obtain in the past can now be easily harvested from blogosphere. The goal of this study is to build an on-line opinion summarization system by opinion and sentiment analysis from blog posts. We plan to present “instant”opinion excerpts from blogs in an overview level and fine grained aspect summarization through asynchronous connection where more computation time is required for the later service. We will also analyze sources of opinions by collecting blog author profiles to further provide various statistical charts which can yield the interesting information for browsers. In the first year, we will implement the first level opinion summarization, which involves opinion retrieval and polarity identification techniques. We will employ existing blog search engine (Google Blog Search, Technorti, etc) to retrieve blog journals which are relevant to the given topics and subsequently fetch the genuine context by removing site template and advertisement. To enable supervised sentiment classification, we will take advantage of existing opinion sharing web-sites (e.g., epinions, Amazon, RateitAll and Complaints) and apply wrapper induction technique to automatically extract positive and negative reviews for training data. In the second year, we will implement the second level opinion summarization module which adopts aspect analysis technique (e.g., LDA) to investigate the latent topics and display the result via asynchronous connection (AJAX). In addition to opinion summarizations, our system also provides statistics of the opinion sources (blog authors) in various dimensions, e.g. sex, age time, location. For blogs in lack of author profiles, we will predict their profiles by building a model from known authors. For the third year, we will deploy our framework using cloud computing to save the time for data collection and query processing. We will also extend our system to Chinese documents by integrate our previous studies including the Chinese word segmentation and unknown word extraction modules into our on-line opinion summarization system. Regarding system evaluation, we will employ applied a set of entity-based and event-baesd query topics. Traditional information retreival metrics (e.g., precision, recall, and mean average precision) and response time are regarded as measures respectively to validate the effectiveness and efficiency. ; 研究期間 9808 ~ 9907
關聯:	財團法人國家實驗研究院科技政策研究與資訊中心
顯示於類別:	[資訊工程學系] 研究計畫

文件中的檔案:

檔案	描述	大小	格式	瀏覽次數
index.html		0Kb	HTML	341	檢視/開啟

在NCUIR中所有的資料項目都受到原著作權保護.

社群 sharing

資料載入中.....