博碩士論文 995302023 詳細資訊




以作者查詢圖書館館藏 以作者查詢臺灣博碩士 以作者查詢全國書目 勘誤回報 、線上人數:106 、訪客IP:3.145.32.221
姓名 鍾智宇(Chih-Yu Chung)  查詢紙本館藏   畢業系所 資訊工程學系在職專班
論文名稱 PTT網站餐廳美食類別擷取之研究
相關論文
★ 行程邀約郵件的辨識與不規則時間擷取之研究★ NCUFree校園無線網路平台設計及應用服務開發
★ 網際網路半結構性資料擷取系統之設計與實作★ 非簡單瀏覽路徑之探勘與應用
★ 遞增資料關聯式規則探勘之改進★ 應用卡方獨立性檢定於關連式分類問題
★ 中文資料擷取系統之設計與研究★ 非數值型資料視覺化與兼具主客觀的分群
★ 關聯性字組在文件摘要上的探討★ 淨化網頁:網頁區塊化以及資料區域擷取
★ 問題答覆系統使用語句分類排序方式之設計與研究★ 時序資料庫中緊密頻繁連續事件型樣之有效探勘
★ 星狀座標之軸排列於群聚視覺化之應用★ 由瀏覽歷程自動產生網頁抓取程式之研究
★ 動態網頁之樣版與資料分析研究★ 同性質網頁資料整合之自動化研究
檔案 [Endnote RIS 格式]    [Bibtex 格式]    [相關文章]   [文章引用]   [完整記錄]   [館藏目錄]   [檢視]  [下載]
  1. 本電子論文使用權限為同意立即開放。
  2. 已達開放權限電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。
  3. 請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。

摘要(中) 隨著資訊科技與網際網路的快速發展加上行動裝置日漸普及化,從網路上獲取生活所需的資訊已成為趨勢主流,然而該如何從豐富且多樣化的大量資料中有效擷取有用的資訊成為一項重大的挑戰,因此資訊擷取(Information Extraction)技術逐漸成為熱門的研究議題,其內容主要是透過整理、篩選…等步驟將非結構化的資料加以整合成為結構化的資料,最後從中有效得擷取出有用的資訊。本研究希望透過資訊擷取技術中機器學習 (Machine Learning) 的方法針對國內最大的電子佈告欄系統 (BBS, Bulletin Board System) 「PTT」中的「Food」版發展出一套自動化擷取文章中餐廳相關資訊並判斷餐廳類別的方法,讓餐廳資訊的取得更加快速且便利。
本文架構主要分為三個部分,第一部分為餐廳相關資訊擷取,透過 PTT Crawler 擷取PTT Food 版上的文章資訊存入資料庫中進行格式化處理,並以人工分析的方式瞭解資料的概貌,接著藉由關鍵字搜尋的方式掃描文章以擷取文章標題、餐廳名稱、電話、地址及 URL資訊。第二部分則是進行餐廳類別擷取,藉由前處理作業時分析資料的結果得知72.5% 的餐廳類別隱含在文章的標題中,因此以文章標題作為餐廳類別的擷取來源,透過 CKIP系統進行斷詞後參考其結果隨機挑選10,000筆標題資料針對隱含其中的餐廳類別進行人工標記;最後再將標記後的資料透過 WIDM 研究室整合了條件式隨機域 (CRF, Conditional Random Field) 所開發的 WIDM_NER_TOOL 搭配BIESO標記法訓練模型。最後則是將標題資料輸入訓練好的模型後分別進行監督式學習與半監督式學習的實驗,並從實驗結果得知利用此法在餐廳類別的擷取可獲得不錯的效果。
摘要(英) With the rapid development of Internet information technology and the popularity of mobile devices, access to information from web pages has become a trend, but how to extract useful information from rich and diverse information becomes a major challenge. The development of information extraction technology has gradually become a popular research topic, its main purpose is through the sorting、screening, unstructured information will be integrated into a structured data, and finally can effectively extract useful information. In this study, we hope to develop a system to automatically extract restaurant type from the FOOD board of PTT of the largest BBS web site in Taiwan through the Machine Learning Method in information extraction technology, so that users can get more convenient and fast access restaurant information
This paper is divided into three parts, the first part is pre-processing, we extract the articles from the PTT FOOD site by the PTT Crawler and then format the data; based on the extracted articles, we analysis of the keyword by statistical from the article to extract the Title、Restaurant Name、Telephone、Address and URL information; The second part is restaurant type extraction; by pre-processing analysis, we know that 72.5% of the restaurant type was implied in the title; we segmented the extracted title data through the CKIP System, and then refer to the results for manual labeling. We used WIDM_NER_TOOL which bundled CRF++ package to train the labeled data and BISEO markers to train an extraction model, the input data are used to capture the restaurant type after the model′s testing process. The last part of the article is experiment, we used the labeled data for supervised learning and used unlabeled data for Semi-Supervised to evaluate system performance. Finally we got a good result from experiment result that used this method in restaurant type extraction.
關鍵字(中) ★ 機器學習
★ 命名實體辨識
★ Tri-Training
關鍵字(英) ★ Machine Learning
★ Named Entity Recognition
★ Tri-Training
論文目次 摘要 I
ABSTRACT II
目錄 III
圖目錄 IV
表目錄 VI
一、 緒論 1
1-1 研究動機 1
1-2 研究背景與限制 2
1-3 章節概要 2
二、 相關研究 4
2-1 中文組織命名實體辨認 5
2-2 監督式學習 6
2-3 半監督式學習 8
三、 設計與實作 12
3-1相關資訊擷取 13
3-2 餐廳類別擷取 15
3-2-1 擷取來源 16
3-2-2 CKIP斷詞與人工資料標記 17
3-2-3 特徵擷取 18
3-2-4 訓練過程和測試過程 19
四、實驗結果與分析 21
4-1評估方式 21
4-2實驗與分析 23
4-2-1 Feature Mining 24
4-2-2 Supervised Experiment 30
4-2-3 Semi-Supervised Experiment 33
五、結論與未來工作 37
參考文獻 39
參考文獻
[1] Dayne Freitag: Information Extraction from HTML: Application of a General Machine Learning Approach. AAAI/IAAI 1998: 517-523.
[2] Thomas G. Dietterich: Machine Learning for Sequential Data: A Review. SSPR/SPR 2002: 15-30.
[3] L. Satish and B.I. Gururaj. 1993. Use of hidden Markov models for partial discharge pattern classification. Electrical Insulation, IEEE Transactions on 28, 2 (Apr 1993), 172–182.
[4] Gideon S. Mann and Andrew McCallum. 2010. Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data. J. Mach. Learn. Res. 11 (March 2010), 955–984.
[5] Andrew McCallum and Wei Li. 2003. Early Results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-enhanced Lexicons. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 -Volume 4 (CONLL ’03). Association for Computational Linguistics, Stroudsburg, PA,USA, 188–191.
[6] Z. Suxiang, Z. Suxian and W. Xiaojie, "Automatic Recognition of Chinese Organization Name Based on Conditional Random Fields," Natural Language Processing and Knowledge Engineering, pp. 229-233, 2007.
[7] Xiying, "A METHOD OF CHINESE ORGANIZATION NAMED ENTITIES RECOGNITION BASED ON STATISTICAL WORD FREQUENCY, PART OF SPEECH AND LENGTH," Broadband Network and Multimedia Technology (IC-BNMT), pp. 637-641, 2011.
[8] L. Yajuan, Y. Jing and H. Liang, "Chinese Organization Name Recognition Based on Multiple Features," Pacific Asia conference on Intelligence and Security Informatics, pp. 136-144, 2012.
[9] Y. Xiying, "A METHOD OF CHINESE ORGANIZATION NAMED ENTITIES RECOGNITION BASED ON STATISTICAL WORD FREQUENCY, PART OF SPEECH AND LENGTH," Broadband Network and Multimedia Technology (IC-BNMT), pp. 637-641, 2011.
[10] L. Yajuan, Y. Jing and H. Liang, "Chinese Organization Name Recognition Based on Multiple Features," Pacific Asia conference on Intelligence and Security Informatics, pp. 136-144, 2012.
[11] Andrew Eliot Borthwick. 1999. A Maximum Entropy Approach to Named Entity Recognition. Ph.D. Dissertation. New York, NY, USA. Advisor(s) Grishman, Ralph. AAI9945252.
[12] CRF++: Yet Another CRF toolkit:http://crfpp.sourceforge.net/
[13] Chien-Lung Chou and Chia-Hui Chang and Ya-Yun Huang, " Boosted Web Named Entity Recognition via Tri-Training", ACM Trans. Asian Low-Resour. Lang. Inf. Process. , Vol 16, pp. 10:1--10:23, December 2016.
[14] L. D. John , M. Andrew and N. C. Fernando, "Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data," ICML Proceedings of the Eighteenth International Conference on Machine Learning, pp. 282-289, 2001.
[15] Z. Suxiang, Z. Suxian and W. Xiaojie, "Automatic Recognition of Chinese Organization Name Based on Conditional Random Fields," Natural Language Processing and Knowledge Engineering, pp. 229-233, 2007.
[16] Y. Xiying, "A METHOD OF CHINESE ORGANIZATION NAMED ENTITIES RECOGNITION BASED ON STATISTICAL WORD FREQUENCY, PART OF SPEECH AND LENGTH," Broadband Network and Multimedia Technology (IC-BNMT), pp. 637-641, 2011.
[17] L. Yajuan, Y. Jing and H. Liang, "Chinese Organization Name Recognition Based on Multiple Features," Pacific Asia conference on Intelligence and Security Informatics, pp. 136-144, 2012.
[18] C.-W. Wu, R. T.-H. Tsai and W.-L. Hsu, "Semi-joint labeling for chinese named entity recognition," Proceedings of the 4th Asia information retrieval conference, pp. 107-116, 2008.
[19] Ellen Riloff, Janyce Wiebe, and Theresa Wilson. 2003. Learning Subjective Nouns Using Extraction Pattern Bootstrapping. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 - Volume 4 (CONLL’03). Association for Computational Linguistics, Stroudsburg, PA, USA, 25–32.
[20] Kristin P. Bennett and Ayhan Demiriz. 1999. Semi-supervised Support Vector Machines.
In Proceedings of the 1998 Conference on Advances in Neural Information Processing Systems II. MIT Press, Cambridge, MA, USA, 368–374.
[21] Avrim Blum and Tom Mitchell. 1998. Combining Labeled and Unlabeled Data with Co-training. In Proceedings of the Eleventh Annual Conference on Computational Learning Theory (COLT’ 98). ACM, New York, NY, USA, 92–100.
[22] Zhi-Hua Zhou and Ming Li. 2005. Tri-Training: Exploiting Unlabeled Data Using Three Classifiers. IEEE Trans. on Knowl. and Data Eng. 17, 11 (Nov. 2005), 1529–1541.
[23] Olivier Chapelle, Bernhard Schölkopf, and Alexander Zien. 2006. Semi-Supervised
Learning.
[24] Ning Yu and Sandra Kubler. 2010. Semi-supervised Learning for Opinion Detection.
[25] Rie Kubota Ando and Tong Zhang. 2005. A High-performance Semi-supervised Learning Method for Text Chunking. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics (ACL ’05). Association for Computational Linguistics, Stroudsburg, PA, USA, 1–9.
[26] Kamal Nigam and Rayid Ghani. 2000. Analyzing the Effectiveness and Applicability of Co-training. In Proceedings of the Ninth International Conference on Informa675 tion and Knowledge Management (CIKM ’00). ACM, New York, NY, USA, 86–93.
[27] Tri Thanh Nguyen, Le Minh Nguyen, and Akira Shimazu. 2008. Using Semi-supervised Learning for Question Classification. Information and Media Technologies 3, 1 (2008), 112–130.
[28] Rion Snow, Daniel Jurafsky, and Andrew Y. Ng. 2005. Learning Syntactic Patterns for Automatic Hypernym Discovery. In Advances in Neural Information Processing Systems 17, L.K. Saul, Y. Weiss, and L. Bottou (Eds.). MIT Press, 1297–1304.
[29] Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. 2009. Distant Supervision for Relation Extraction Without Labeled Data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2 (ACL ’09). Association for Computational Linguistics, Stroudsburg, PA, USA, 1003–1011.
[30] Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. 2008. Freebase: A Collaboratively Created Graph Database for Structuring Human Knowledge. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data (SIGMOD ’08). ACM, New York, NY, USA, 1247–1250.
[31] Matthew Michelson and Craig A. Knoblock. 2009. Exploiting Background Knowledge to Build Reference Sets for Information Extraction. In Proceedings of the 21st International Jont Conference on Artifical Intelligence (IJCAI’09). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2076–2082.
[32] Joohui An, Seungwoo Lee, and Gary Geunbae Lee. 2003. Automatic Acquisition of Named Entity Tagged Corpus from World Wide Web. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 2 (ACL’03). Association for Computational Linguistics, Stroudsburg, PA, USA, 165–168.
[33] Adam Rae, Vanessa Murdock, Adrian Popescu, and Hugues Bouchard. 2012. Mining the Web for Points of Interest. In Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’12). ACM, New York, NY, USA, 711–720.
[34] Ruiji Fu, Bing Qin, and Ting Liu. 2011. Generating chinese named entity data from a parallel corpus. In In Proceedings of 5th International Joint Conference on Natural Language Processing. 264–272.
指導教授 張嘉惠 審核日期 2017-7-24
推文 facebook   plurk   twitter   funp   google   live   udn   HD   myshare   reddit   netvibes   friend   youpush   delicious   baidu   
網路書籤 Google bookmarks   del.icio.us   hemidemi   myshare   

若有論文相關問題,請聯絡國立中央大學圖書館推廣服務組 TEL:(03)422-7151轉57407,或E-mail聯絡  - 隱私權政策聲明