摘要: | 隨著社群媒體的興起,使用者願意在平台上以不同的形式表達立場、評論觀點及分享貼文。社群媒體強調其訊息的即時傳播性,導致串流不斷地產生,使用者如何更快速的從這樣大量的資訊中,瞭解目前熱門的主題、使用者關注的事件等,變成一大挑戰及困難。其中,應用在社群媒體中進行主題偵測與追蹤(Topic Detection and Tracking, TDT)變成一大熱門的研究項目。傳統的TDT研究主要針對結構化高的文章,如新聞文章等,本研究以Facebook作為研究平台,針對公開粉絲專頁的短貼文進行主題偵測與追蹤的研究。
本研究的研究目的為讓使用者更快速地掌握主題之下的事件,並透過資料視覺化的呈現,來將設計的架構以故事劃分、源頭故事偵測、群集偵測、追蹤及故事鏈結偵測,五個主題偵測及追蹤系統應具備的能力,做新聞實例的探討並解釋其商業用途。本研究主要將系統流程區分為三個階段。資料蒐集與擷取:透過Facebook Graph API抓取公開粉絲專頁的貼文資訊,並以關鍵字比對的方式將貼文映射到特定主題;資料分析:透過Incremental TF-DF來抓取貼文的核心特徵字詞並且避免字詞維度過高的問題,接著,透過k-medoids文件分群技術及自適應決定分群數目的演算法來達到自動分群辨別出事件;資料呈現:透過群集分析以及資料視覺化的技術來針對分析結果做大規模呈現。;As the rise of social media, people are more willing to declare their position, give comments and share others’ posts on the platform. Social medias emphasize information immediacy, which leads to stream generate constantly. As a result, how users know the hot topics and the events users interest becomes a difficult challenge. In particular,“Topic Detention and Tracking”(TDT) becomes a popular research project applied on social medias. Traditional TDT research mainly focused on high structured articles, e.g., news articles. This research takes Facebook as the research platform and use “Topic Detention and Tracking” to discuss the short-text documents on the public fan page.
The primary purpose of the research is to allow users to realize events of topics through data visualization using five major themes of detections: story segmentation, first story detection, topic tracking, topic detection, and link detection. The application and capability of these detections and tracking system will then be used for discussion of news and explanation of its commercial purposes. This research divides the system procedure to three stages. The first is data collection and catch, which get the posts information on the public fan pages through the Facebook Graph API and map the posts to certain topic through the keyword mapping. The second stage is data analysis, which get the keywords from the posts by Incremental TF-DF and avoid the problem of excessive term dimension. Then, through the document clustering technology, k-medoids, and the auto-decide clustering numbers algorithm to achieve auto-clustering distinguish events. The third stage is data visualization, which through clustering analysis and data visualization technology to visualize the analysis result in a large scale. |