應用強化式學習探勘活動來源網站;Event Source Page Discovery via Reinforcement Learning

NCU Institutional Repository > 資訊電機學院 > 資訊工程研究所 > 博碩士論文 > Item 987654321/86632

請使用永久網址來引用或連結此文件: http://ir.lib.ncu.edu.tw/handle/987654321/86632

題名:	應用強化式學習探勘活動來源網站;Event Source Page Discovery via Reinforcement Learning
作者:	廖于晴;Liao, Yu-Ching
貢獻者:	資訊工程學系
關鍵詞:	強化式學習;網路探勘;活動來源頁面探勘;活動來源頁面分類;Reinforcement Learning;Web Mining;Event Source Page Discovery;Event Source Page Classification
日期:	2021-08-04
上傳時間:	2021-12-07 13:02:32 (UTC+8)
出版者:	國立中央大學
摘要:	隨著交通方便性的提升，旅行已成為現代人們的常態，旅遊型態也漸漸地產生了改變，不再是單純的觀光而是希望可以更深入的體會當地的風土民情，其中參與當地的特色活動便是一種可以體會當地特色的方式。然而，在網站中搜尋當地的活動對不熟悉當地的人卻是一個負擔，因為無論是政府組織亦或是民間組織，他們通常都將活動的資訊更新在他們自己的網站中，散落在WWW中，因此在這邊我們希望可以使用一個智慧的爬蟲系統，可以自動且有效率的探勘並收集「活動來源網頁」。此文主要是想要介紹我們要如何訓練一個智慧爬蟲模型，讓他可以從起始頁面網站去探勘此網站的活動來源網頁，因為我們認為每個網站的活動來源頁面的個數都是不同的，因此智慧爬蟲於每個網站中走的步伐數皆為變動的，在這邊我們也會提及如何設定閥值讓我們的模型知道是否該停止探索此網站，在此模型中我們採用了強化式學習(Reinforcement learning)並結合了多任務學習(Multitask Learning)來訓練，也因為我們只有有限的標記資料，因此我們採用兩階段的訓練架構進行訓練，第一階段會運用少量的標記資料先進行預訓練而後再透過未標記資料與我們的「活動來源網頁分類器」進行微調模型，最後藉由我們所提出的方法，我們的爬蟲模型於真實世界的資料上達到了74\%的準確度。;With the convenience of transportation, traveling is no longer about sightseeing or taking a professional photo but more about joining local event to experience local culture. Most event organizers such as governments, enterprises and organizations will update event information somewhere on their website. How to efficiently find the page where event announcements are listed for any given website is called the problem of event source discovery. In this paper, we show a deep reinforcement learning model to train our event source discovery agent. We use two stages to train our crawler, pre-training and fine-tuning. In the pre-training phase, the model is trained with limited labeled data, where each episode has a fixed time step. In the fine-tuning phase, the agent is trained using unlabeled data and a reward system based on an event source page classifier. The agent learns whether to continue exploring or stop exploring through an adaptive threshold, so the number of steps in each episode changes during the fine training. The proposed agent achieves \textcolor{red}{74\%} Return-On-Investment (i.e. precision) with \textcolor{red}{1.3} unit cost (the number of clicks for each event source page) on the real word data set.
顯示於類別:	[資訊工程研究所] 博碩士論文

文件中的檔案:

檔案	描述	大小	格式	瀏覽次數
index.html		0Kb	HTML	52	檢視/開啟

在NCUIR中所有的資料項目都受到原著作權保護.

社群 sharing

資料載入中.....