PTT網站餐廳美食類別擷取之研究

NCU Institutional Repository > 資訊電機學院 > 資訊工程學系碩士在職專班 > 博碩士論文 > Item 987654321/74641

請使用永久網址來引用或連結此文件: http://ir.lib.ncu.edu.tw/handle/987654321/74641

題名:	PTT網站餐廳美食類別擷取之研究
作者:	鍾智宇;Chung, Chih-Yu
貢獻者:	資訊工程學系在職專班
關鍵詞:	機器學習;命名實體辨識;Tri-Training;Machine Learning;Named Entity Recognition;Tri-Training
日期:	2017-07-24
上傳時間:	2017-10-27 14:34:37 (UTC+8)
出版者:	國立中央大學
摘要:	隨著資訊科技與網際網路的快速發展加上行動裝置日漸普及化，從網路上獲取生活所需的資訊已成為趨勢主流，然而該如何從豐富且多樣化的大量資料中有效擷取有用的資訊成為一項重大的挑戰，因此資訊擷取（Information Extraction）技術逐漸成為熱門的研究議題，其內容主要是透過整理、篩選…等步驟將非結構化的資料加以整合成為結構化的資料，最後從中有效得擷取出有用的資訊。本研究希望透過資訊擷取技術中機器學習 (Machine Learning) 的方法針對國內最大的電子佈告欄系統 (BBS, Bulletin Board System) 「PTT」中的「Food」版發展出一套自動化擷取文章中餐廳相關資訊並判斷餐廳類別的方法，讓餐廳資訊的取得更加快速且便利。本文架構主要分為三個部分，第一部分為餐廳相關資訊擷取，透過 PTT Crawler 擷取PTT Food 版上的文章資訊存入資料庫中進行格式化處理，並以人工分析的方式瞭解資料的概貌，接著藉由關鍵字搜尋的方式掃描文章以擷取文章標題、餐廳名稱、電話、地址及 URL資訊。第二部分則是進行餐廳類別擷取，藉由前處理作業時分析資料的結果得知72.5% 的餐廳類別隱含在文章的標題中，因此以文章標題作為餐廳類別的擷取來源，透過 CKIP系統進行斷詞後參考其結果隨機挑選10,000筆標題資料針對隱含其中的餐廳類別進行人工標記；最後再將標記後的資料透過 WIDM 研究室整合了條件式隨機域 (CRF, Conditional Random Field) 所開發的 WIDM_NER_TOOL 搭配BIESO標記法訓練模型。最後則是將標題資料輸入訓練好的模型後分別進行監督式學習與半監督式學習的實驗，並從實驗結果得知利用此法在餐廳類別的擷取可獲得不錯的效果。;With the rapid development of Internet information technology and the popularity of mobile devices, access to information from web pages has become a trend, but how to extract useful information from rich and diverse information becomes a major challenge. The development of information extraction technology has gradually become a popular research topic, its main purpose is through the sorting、screening, unstructured information will be integrated into a structured data, and finally can effectively extract useful information. In this study, we hope to develop a system to automatically extract restaurant type from the FOOD board of PTT of the largest BBS web site in Taiwan through the Machine Learning Method in information extraction technology, so that users can get more convenient and fast access restaurant information This paper is divided into three parts, the first part is pre-processing, we extract the articles from the PTT FOOD site by the PTT Crawler and then format the data; based on the extracted articles, we analysis of the keyword by statistical from the article to extract the Title、Restaurant Name、Telephone、Address and URL information; The second part is restaurant type extraction; by pre-processing analysis, we know that 72.5% of the restaurant type was implied in the title; we segmented the extracted title data through the CKIP System, and then refer to the results for manual labeling. We used WIDM_NER_TOOL which bundled CRF++ package to train the labeled data and BISEO markers to train an extraction model, the input data are used to capture the restaurant type after the model′s testing process. The last part of the article is experiment, we used the labeled data for supervised learning and used unlabeled data for Semi-Supervised to evaluate system performance. Finally we got a good result from experiment result that used this method in restaurant type extraction.
顯示於類別:	[資訊工程學系碩士在職專班 ] 博碩士論文

文件中的檔案:

檔案	描述	大小	格式	瀏覽次數
index.html		0Kb	HTML	575	檢視/開啟

在NCUIR中所有的資料項目都受到原著作權保護.

社群 sharing

資料載入中.....