中文聚會活動來源探索暨上下文感知式精細資訊擷取之研究;Chinese Meetup Event Extraction via Event Source Page Discovery and Context-Aware Information Extraction

NCU Institutional Repository > 資訊電機學院 > 資訊工程研究所 > 博碩士論文 > Item 987654321/98332

jsp.display-item.identifier=請使用永久網址來引用或連結此文件: https://ir.lib.ncu.edu.tw/handle/987654321/98332

题名:	中文聚會活動來源探索暨上下文感知式精細資訊擷取之研究;Chinese Meetup Event Extraction via Event Source Page Discovery and Context-Aware Information Extraction
作者:	林圓皓;Lin, Yuan-Hao
贡献者:	資訊工程學系
关键词:	事件來源頁面發現;自動分頁識別;包裝程式歸納;樣板移除;活動檢測;活動擷取;Event source page discovery;Automatic pagination recognition;Wrapper induction;Boilerplate removal;Event detection;Meetup event extraction
日期:	2025-07-26
上传时间:	2025-10-17 12:38:27 (UTC+8)
出版者:	國立中央大學
摘要:	自動從互聯網提取活動信息能顯著提升活動發現的便利性。現有方法通常依賴事件社交網路（EBSN）提供的開放 API，以捕捉特定地區與主題的活動資料，或透過全面的網路爬取方式來過濾活動，兩者皆存在一定的局限性。本研究提出了一個新穎的五階段框架，針對活動組織者網站與學校系所網站，自動提取活動資訊。該框架包括：事件來源頁面發現、自動分頁識別、樣板移除、活動檢測與活動擷取五個階段。我們以 Facebook 活動頁面為起點，蒐集了潛在活動組織者網站，並另外收集了學校系所網站資料。最終共建立了 19,013 個事件來源記錄集資料 API，並在 2023 年 7 月 13 日至 2025 年 6 月 17 日期間以 24 小時週期排程擷取，累計抓取913,853 個張貼頁面的連結。經由樣板移除模塊處理後，我們提取了 404,497 條信息，並透過活動檢測模塊識別出 99,833 條活動消息。在這些活動組織者網站中，活動頁面所佔比例達到 11%，顯著高於 Google 研究團隊王等人在一般網站中發現的 1% 活動頁面比例，顯示本方法在成本效益上的優勢。最終，我們透過活動擷取模塊成功擷取了 73,913 個活動。本文探討三大問題：(1) 事件來源的自動建立與定期爬取，(2) 自動從張貼頁面中擷取活動詳細資訊，以及 (3) 活動的搜尋與分析。本研究首先針對網路資訊快速變動且分散的特性，提出事件來源頁面發現策略與自動分頁識別模型，能自動定位並整合更新頁為長期可爬取的事件來源，再透過包裝程式歸納和排程機制定期擷取最新張貼頁面。其次，面對異質且雜訊龐雜的網頁貼文，我們結合樣板移除、活動檢測與活動擷取技術，建立細粒度資訊擷取流程，可準確萃取活動標題、地點與起訖日期等關鍵欄位，並將結果統一封裝為結構化資料，作為後續搜尋與分析的基礎。最後，為提升資料可用性與應用價值，本研究建立活動搜尋服務，提供多元且直觀的視覺化介面，並導入活動類型與年齡層分類模型，賦予語意標籤以支援多條件檢索與趨勢分析，協助用戶和企業有效發掘與解讀活動資訊。實驗結果顯示，所提出之端到端架構能在大規模中文網頁環境中有效建構結構化活動資料庫。;Automatically extracting meetup event information from the web can significantly enhance the convenience of event discovery. Existing approaches typically rely on open APIs provided by Event-Based Social Networks (EBSNs) to capture meetup event data for specific regions and topics, or conduct large-scale web crawling to filter meetup events, both of which have inherent limitations. In this study, we propose a novel five-stage framework for extracting meetup event information from event organizers’ websites and academic department websites. The framework consists of event source page discovery, automatic pagination recognition, boilerplate removal, event detection, and meetup event extraction. Starting from Facebook event pages, we collected potential event organizer websites, supplemented by the acquisition of departmental websites from academic institutions. Ultimately, we established 19,013 event source record set APIs, scheduled for extraction at 24-hour intervals between July 13, 2023, and June 17, 2025, cumulatively retrieving links to 913,853 posting pages. After processing through the boilerplate removal module, we extracted 404,497 messages and, through the event detection module, identified 99,833 event messages. Among these event organizer websites, event pages constituted 11% of the total, a proportion significantly higher than the 1% event page rate discovered in general websites by Wang et al. of the Google research team, thus demonstrating the superior cost-effectiveness of our method. Ultimately, the meetup event extraction module successfully extracted 73,913 meetup events. We focus on three major challenges: (1) automatic establishment and periodic crawling of event sources, (2) automated extraction of event details from posting pages, and (3) the search and analysis of events. In response to the rapidly evolving and highly dispersed nature of web information, we first propose a strategy for event source page discovery alongside an automated pagination recognition model, capable of autonomously locating and consolidating update pages into sustainable, long-term event sources. These sources are then regularly crawled using wrapper induction and a scheduling mechanism. Secondly, to tackle the heterogeneity and pervasive noise in web posting pages, we integrate boilerplate removal, event detection, and meetup event extraction techniques to establish a fine-grained information extraction pipeline. This process accurately captures key fields such as event titles, venues, and start and end dates, and uniformly encapsulates the results into structured data to serve as a foundation for subsequent search and analysis. Finally, to enhance the usability and practical value of the data, we develop an event search service offering a diverse and intuitive visual interface. By integrating classification models for event types and age groups, we assign semantic tags that facilitate multi-criteria retrieval and trend analysis, empowering users and businesses to effectively uncover and interpret event-related insights. Experimental results demonstrate that the proposed end-to-end framework is capable of constructing a robust structured event database within a large-scale Chinese web environment.
显示于类别:	[資訊工程研究所] 博碩士論文

文件中的档案:

档案	描述	大小	格式	浏览次数
index.html		0Kb	HTML	60	检视/开启

在NCUIR中所有的数据项都受到原著作权保护.

社群 sharing

数据加载中.....