博碩士論文 995302023 完整後設資料紀錄

DC 欄位 語言
DC.contributor資訊工程學系在職專班zh_TW
DC.creator鍾智宇zh_TW
DC.creatorChih-Yu Chungen_US
dc.date.accessioned2017-7-24T07:39:07Z
dc.date.available2017-7-24T07:39:07Z
dc.date.issued2017
dc.identifier.urihttp://ir.lib.ncu.edu.tw:88/thesis/view_etd.asp?URN=995302023
dc.contributor.department資訊工程學系在職專班zh_TW
DC.description國立中央大學zh_TW
DC.descriptionNational Central Universityen_US
dc.description.abstract隨著資訊科技與網際網路的快速發展加上行動裝置日漸普及化,從網路上獲取生活所需的資訊已成為趨勢主流,然而該如何從豐富且多樣化的大量資料中有效擷取有用的資訊成為一項重大的挑戰,因此資訊擷取(Information Extraction)技術逐漸成為熱門的研究議題,其內容主要是透過整理、篩選…等步驟將非結構化的資料加以整合成為結構化的資料,最後從中有效得擷取出有用的資訊。本研究希望透過資訊擷取技術中機器學習 (Machine Learning) 的方法針對國內最大的電子佈告欄系統 (BBS, Bulletin Board System) 「PTT」中的「Food」版發展出一套自動化擷取文章中餐廳相關資訊並判斷餐廳類別的方法,讓餐廳資訊的取得更加快速且便利。 本文架構主要分為三個部分,第一部分為餐廳相關資訊擷取,透過 PTT Crawler 擷取PTT Food 版上的文章資訊存入資料庫中進行格式化處理,並以人工分析的方式瞭解資料的概貌,接著藉由關鍵字搜尋的方式掃描文章以擷取文章標題、餐廳名稱、電話、地址及 URL資訊。第二部分則是進行餐廳類別擷取,藉由前處理作業時分析資料的結果得知72.5% 的餐廳類別隱含在文章的標題中,因此以文章標題作為餐廳類別的擷取來源,透過 CKIP系統進行斷詞後參考其結果隨機挑選10,000筆標題資料針對隱含其中的餐廳類別進行人工標記;最後再將標記後的資料透過 WIDM 研究室整合了條件式隨機域 (CRF, Conditional Random Field) 所開發的 WIDM_NER_TOOL 搭配BIESO標記法訓練模型。最後則是將標題資料輸入訓練好的模型後分別進行監督式學習與半監督式學習的實驗,並從實驗結果得知利用此法在餐廳類別的擷取可獲得不錯的效果。zh_TW
dc.description.abstractWith the rapid development of Internet information technology and the popularity of mobile devices, access to information from web pages has become a trend, but how to extract useful information from rich and diverse information becomes a major challenge. The development of information extraction technology has gradually become a popular research topic, its main purpose is through the sorting、screening, unstructured information will be integrated into a structured data, and finally can effectively extract useful information. In this study, we hope to develop a system to automatically extract restaurant type from the FOOD board of PTT of the largest BBS web site in Taiwan through the Machine Learning Method in information extraction technology, so that users can get more convenient and fast access restaurant information This paper is divided into three parts, the first part is pre-processing, we extract the articles from the PTT FOOD site by the PTT Crawler and then format the data; based on the extracted articles, we analysis of the keyword by statistical from the article to extract the Title、Restaurant Name、Telephone、Address and URL information; The second part is restaurant type extraction; by pre-processing analysis, we know that 72.5% of the restaurant type was implied in the title; we segmented the extracted title data through the CKIP System, and then refer to the results for manual labeling. We used WIDM_NER_TOOL which bundled CRF++ package to train the labeled data and BISEO markers to train an extraction model, the input data are used to capture the restaurant type after the model′s testing process. The last part of the article is experiment, we used the labeled data for supervised learning and used unlabeled data for Semi-Supervised to evaluate system performance. Finally we got a good result from experiment result that used this method in restaurant type extraction.en_US
DC.subject機器學習zh_TW
DC.subject命名實體辨識zh_TW
DC.subjectTri-Trainingzh_TW
DC.subjectMachine Learningen_US
DC.subjectNamed Entity Recognitionen_US
DC.subjectTri-Trainingen_US
DC.titlePTT網站餐廳美食類別擷取之研究zh_TW
dc.language.isozh-TWzh-TW
DC.type博碩士論文zh_TW
DC.typethesisen_US
DC.publisherNational Central Universityen_US

若有論文相關問題,請聯絡國立中央大學圖書館推廣服務組 TEL:(03)422-7151轉57407,或E-mail聯絡  - 隱私權政策聲明