以作者查詢圖書館館藏 、以作者查詢臺灣博碩士 、以作者查詢全國書目 、勘誤回報 、線上人數:92 、訪客IP:3.135.214.139
姓名 周崇光(Chung-Kuang Chou) 查詢紙本館藏 畢業系所 通訊工程學系 論文名稱 線上RSS新聞資料流中主題性事件監測機制之設計與實作
(A Topic-based Event Monitor on Online RSS News Streams)相關論文 檔案 [Endnote RIS 格式] [Bibtex 格式] [相關文章] [文章引用] [完整記錄] [館藏目錄] [檢視] [下載]
- 本電子論文使用權限為同意立即開放。
- 已達開放權限電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。
- 請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
摘要(中) 現今線上新聞服務普遍提供Really Simple Syndication(RSS)頻道讓使用者訂閱,但是使用者在面對如此多的RSS頻道中,如何能夠有效率地選擇和獲得想要的資訊,這是智慧型網路資訊檢索服務在系統設計上所面臨的主要挑戰。
本研究以RSS新聞資料流為新聞來源,設計一套應用於本地端新聞資料庫與遠端RSS文件之間的RSS新聞資料同步機制,並且透過使用者事先設定的關鍵字,由系統自動地為使用者監測相關新聞。本研究提出兩套監測機制,分別為Clustering Based on only Temporal Information (CBTI) 與Time-Constrained TF-IDF Schemes(TCTIS)。首先,CBTI機制利用K-Means演算法以RSS新聞發布之時間對單條RSS頻道做群聚運算,再根據群聚運算所得的群集(cluster)之中心點時間(centroid time)來建立不同頻道之間的群集關係,系統依據此群集關係進一步將不同條頻道的群集合併為單條結果,以供使用者檢視。另一方面,TCTIS機制則透過TF-IDF/IWF遞增模型進行新聞主題偵測與追蹤,系統在偵測出一個新主題時會發出通知給使用者,並持續地追蹤舊主題的相關報導,以利使用者調閱過去舊主題的相關報導。
然而,由於“新聞文字上的經常性修正與調整”,此一特性導致本地端資料庫與遠端RSS文件之間的同步機制不易設計,本研究提出依據RSS文摘(Item)所具有的四個子標籤(標題、描述、連結和發布時間)內容字串,更進一步地交叉判斷兩文摘間的新舊關係,以提升所蒐集到資料的可靠性。再者,由於RSS新聞文摘本身所存在的“短文特性”,導致傳統的TF-IDF/IWF遞增模型在RSS新聞資料流中做主題性事件監測時無法有良好的分群效果,本研究提出一加入時間考量的主題偵測與追蹤機制(即TCTIS機制),使得以增強TF-IDF/IWF遞增模型在RSS新聞資訊流下主題偵測與追蹤的效果。
最後,本研究指出實作上在蒐集RSS文件時所遭遇的問題,可供對RSS有興趣的研究人員在進行RSS 文件資料蒐集或是RSS閱讀器軟體程式開發時之參考。
摘要(英) Online news providers now offer subscription services of the Really Simple Syndication (RSS) channels. Users with many RSS channels however feel awkward to use when they want to find and watch interesting news items dispersed in separate channels simultaneously. How to select and acquire wanted information efficiently is a significant challenge for designing an intelligent news information retrieval system.
The study of this thesis uses RSS news streams as news sources, and proposes a news data synchronization mechanism for synchronizing the remote RSS documents and the local news database. Then, the proposed mechanism is able to automatically monitor the related news in response to users’ pre-given keywords. Specifically, this proposal includes two com-plementary monitoring schemes: Clustering Based on only Temporal Information (CBTI) and Time-Constrained TF-IDF (TCTIS) Schemes. The CBTI uses the K-Means algorithm to cluster RSS news items in every channel corresponding to their temporal information. Then, CBTI uses the cluster centroid time of each cluster in each channel to find the temporal relationship among other clusters in multiple channels. Finally, CBTI uses this relationship to construct a merged channel for the user to read. On the other hand, TCTIS utilizes the incremental TF-IDF/IWF model to do topic-based detection and tracking processes. When a news item reporting a new topic is detected, the mechanism could notify users of this event and continually track related news items from old topics, thereby gathering all related items for users to later read them in an efficient and friendly way.
However, owing to frequent changes of news texts, the design of news data synchroniza-tion mechanism further considers four specific labels inside news content, particularlyand compares every pair of items to discern their relation. For example, which is new or both are the same. In addition, because an RSS news item is a short text itself, the clustered results based on the traditional incremental TF-IDF/IWF is not good enough. To cope with this problem, TCTIS is able to enhance the performance by additionally taking the temporal factor into consideration.
Furthermore, this study lists several practical points in regard to RSS news gathering and RSS reader software development. It is believed that they are worthy of notice by interested researchers.
關鍵字(中) ★ 新聞資料流
★ 主題偵測
★ 主題追蹤
★ 新聞監測
★ RSS關鍵字(英) ★ topic tracking
★ topic detection
★ news stream
★ RSS
★ news monitoring論文目次 摘要 i
Abstract ii
致謝 iv
第一章、 緒論 1
1.1 研究動機與目的 1
1.2 挑戰 4
1.3 Really Simple Syndication (RSS)文件格式 4
1.4 RSS與使用者間的關係 6
1.5 RSS新聞資料流的特性 7
1.6 本研究之內容和貢獻 8
第二章、 相關文獻與預備知識 10
2.1 主題偵測與追蹤 10
2.2 文字表示 11
2.2.1 TF-IDF 11
2.2.2 基於TF-IDF的遞增模型(incremental TF-IDF model) 12
2.2.3 基於TF-IWF的遞增模型(incremental TF-IWF model) 13
2.2.4 文字相似度比較 13
2.2.5 短文的相似度比較 14
2.3 RSS資料更新 14
第三章、 系統架構 16
3.1 系統架構概觀 16
3.2 同步本地資料庫與遠端RSS文件 17
3.2.1 下載 17
3.2.2 解析 17
3.2.3 更新 18
3.3 監測機制:CBTI 19
3.3.1 關鍵字過濾 20
3.3.2 單條RSS頻道的分群 20
3.3.3 多條 RSS頻道上群集的配對 25
3.4 監測機制:TCTIS 26
3.4.1 線上式主題偵測與追蹤 26
第四章、 實驗 29
4.1 資料集 29
4.1.1 關鍵字“Taiwan” 30
4.1.2 關鍵字“Obama” 30
4.1.3 人工標記主題 31
4.2 評估標準 36
4.3 監測機制:CBTI 37
4.3.1 關鍵字“Taiwan” 37
4.3.2 關鍵字“Obama” 37
4.3.3 參數w的選擇 38
4.4 監測機制:TCTIS 42
4.4.1 文字的前處理 42
4.4.2 關鍵字“Taiwan” 42
4.4.3 關鍵字“Obama” 50
4.4.4 參數的建議值 53
4.5 比較 53
第五章、 未來工作 59
5.1 RSS相關 59
5.1.1 更新機制 59
5.1.2 時間訊息應採用其文摘的發布時間字串或採用文摘抵達時間 59
5.1.3 自動偵測出正確的時區 59
5.2 主題偵測與追蹤 60
5.2.1 參數 60
5.2.2 多主題模型 60
5.2.3 判斷一新聞是否為針對一事件而報導 60
5.2.4 監測關鍵字的語意(Keyword Semantics) 61
5.2.5 命名實體(Named Entity) 61
5.2.6 同義詞或近似詞 62
5.2.7 Latent Dirichlet Allocation 63
第六章、 結論 64
參考文獻 65
附錄、系統展示 67
參考文獻 [1]RSS 2.0 Specification (RSS 2.0 at Harvard Law), http://cyber.law.harvard.edu/rss/rss.html
[2]RFC 822, http://asg.web.cmu.edu/rfc/rfc822.html
[3]NIST Speech Group Website: Topic Detection and Tracking Evaluation, http://www.itl.nist.gov/iad/mig/tests/tdt/
[4]James Allan, Ron Papka, and Victor Lavrenko, “On-line new event detection and track-ing”, Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, p.p. 37-45, August, 1998
[5]Yiming Yang, Tom Pierce, and Jaime Carbonell, “A study of retrospective and on-line event detection”, Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, p.p. 28-36, August, 1998
[6]Thorsten Brants, and Francine Chen, “A System for new event detection”, Proceedings of the 26th annual international ACM SIGIR conference on Research and development in in-formaion retrieval, p.p. 330-337, July, 2003
[7]Michael Steinbach, George Karypis, and Vipin Kumar, “A Comparison of Document Clustering Techniques”, Proc. KDD-2000 Workshop TextMining, August, 2000
[8]Qi He, Kuiyu Chang, and Ee-Peng Lim, “Using Burstiness to Improve Clustering of Top-ics in News Streams”, Proceedings of the 2007 Seventh IEEE International Conference on Data Mining, p.p. 493-498, October, 2007
[9]Elizabeth Leeds Hohman and David J. Marchette, “A dynamic graph model for analyzing streaming news documents”, Proceedings of the IEEE Symposium on Computational In-telligence and Data Mining, p.p. 462-469, March, 2007
[10]Giridhar Kumaran and James Allan, “Text classification and named entities for new event detection”, Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, p.p. 297-304, July, 2004
[11]Canhui Wang, Min Zhang, Shaoping Ma, and Liyun Ru, “Automatic online news issue construction in web environment”, Proceedings of the 17th international conference on World Wide Web, p.p 457-466, April, 2008
[12]Zhiwei Li, Bin Wang, Mingjing Li, and Wei-Ying Ma, “A probabilistic model for re-trospective news event detection”, Proceedings of the 28th annual international ACM SI-GIR conference on Research and development in information retrieval, p.p. 106-113, Au-gust, 2005
[13]Ka Cheung Sia, Junghoo Cho, and Hyun-Kyu Cho, “Efficient Monitoring Algorithm for Fast News Alerts”, IEEE Transactions on Knowledge and Data Engineering, vol.19 no.7, pp.950-961, July, 2007
[14]Young Geun Han, Sang Ho Lee, Jae Hwi Kim, and Yanggon Kim, “A new aggregation policy for RSS services”, Proceedings of the 2008 international workshop on Context enabled source and service selection, integration and adaptation: organized with the 17th International World Wide Web Conference, April, 2008
[15]Qi He, Kuiyu Chang, and Ee-Peng Lim, “A model for anticipatory event detection”, Proceedings of 25th International Conference on Conceptual Modeling, pp. 168-181, No-vember, 2006.
[16]Wen-Tau Yih and Christopher Meek, “Improving similarity measures for short seg-ments of text”, Proceedings of the 22nd national conference on Artificial intelligence, p.p. 1489-1494, July, 2007
[17]Evgeniy Gabrilovich and Shaul Markovitch., “Wikipedia-based semantic interpretation for Natural Langauge Processing”, Journal of Artificial Intelligence Research, v.34, pp.443-498, January, 2009
[18]David M. Blei , Andrew Y. Ng , Michael I. Jordan, Latent dirichlet allocation, Journal of Machine Learning Research, v.3, pp.993-1022, March, 2003
指導教授 胡誌麟(Chih-Lin Hu) 審核日期 2010-7-27 推文 facebook plurk twitter funp google live udn HD myshare reddit netvibes friend youpush delicious baidu 網路書籤 Google bookmarks del.icio.us hemidemi myshare