dc.description.abstract | Online news providers now offer subscription services of the Really Simple Syndication (RSS) channels. Users with many RSS channels however feel awkward to use when they want to find and watch interesting news items dispersed in separate channels simultaneously. How to select and acquire wanted information efficiently is a significant challenge for designing an intelligent news information retrieval system.
The study of this thesis uses RSS news streams as news sources, and proposes a news data synchronization mechanism for synchronizing the remote RSS documents and the local news database. Then, the proposed mechanism is able to automatically monitor the related news in response to users’ pre-given keywords. Specifically, this proposal includes two com-plementary monitoring schemes: Clustering Based on only Temporal Information (CBTI) and Time-Constrained TF-IDF (TCTIS) Schemes. The CBTI uses the K-Means algorithm to cluster RSS news items in every channel corresponding to their temporal information. Then, CBTI uses the cluster centroid time of each cluster in each channel to find the temporal relationship among other clusters in multiple channels. Finally, CBTI uses this relationship to construct a merged channel for the user to read. On the other hand, TCTIS utilizes the incremental TF-IDF/IWF model to do topic-based detection and tracking processes. When a news item reporting a new topic is detected, the mechanism could notify users of this event and continually track related news items from old topics, thereby gathering all related items for users to later read them in an efficient and friendly way.
However, owing to frequent changes of news texts, the design of news data synchroniza-tion mechanism further considers four specific labels inside news content, particularly and compares every pair of items to discern their relation. For example, which is new or both are the same. In addition, because an RSS news item is a short text itself, the clustered results based on the traditional incremental TF-IDF/IWF is not good enough. To cope with this problem, TCTIS is able to enhance the performance by additionally taking the temporal factor into consideration.
Furthermore, this study lists several practical points in regard to RSS news gathering and RSS reader software development. It is believed that they are worthy of notice by interested researchers.
| en_US |