會議公告網站資訊擷取之研究; Conference Information Extraction: Segmentation Base Approach

NCU Institutional Repository > 資訊電機學院 > 資訊工程學系碩士在職專班 > 博碩士論文 > Item 987654321/8697

jsp.display-item.identifier=請使用永久網址來引用或連結此文件: http://ir.lib.ncu.edu.tw/handle/987654321/8697

题名:	會議公告網站資訊擷取之研究;Conference Information Extraction: Segmentation Base Approach
作者:	胡姝涵;Shu-Han Hu
贡献者:	資訊工程學系碩士在職專班
关键词:	資訊擷取;機器學習;Information Extraction;Machine Learning
日期:	2006-07-07
上传时间:	2009-09-22 11:33:09 (UTC+8)
出版者:	國立中央大學圖書館
摘要:	隨著資訊科技的進步，網際網路的快速與便利使得我們漸漸以網頁來取代傳統以紙張為主的資料呈現方式，然而網頁呈現的豐富與多樣化，使得有效擷取有用的資訊成為一項重大的挑戰。資訊擷取（Information Extraction）的技術主要是將非結構化的資料，透過整理、篩選，加以整合成為結構化的資料，最後便可有效的擷取出有用的資訊。資訊擷取的設計，最直接的方法是針對各個網站利用人工撰寫資訊擷取的方式，架構出符合此網站的資訊擷取系統，但由於網站的格式隨時有可能發生變更，或是因應不同作者架構出的網站格式不同，我們都必須修改撰寫不同的資訊擷取程式，這是非常不經濟的。因此，如何利用自動化的方式因應不同的網站格式來擷取網頁資訊，是設計資訊擷取程式最大的目標。自動化的資訊擷取設計，就要仰賴機器學習（Machine Learning）的方式，如何讓電腦具有學習的能力，從以往的經驗學習到知識和擷取規則，使得電腦本身具有擷取正確資訊的能力。　　本篇論文主要針對國際性會議（International Conference）公告網站，擷取來自不同佈告者公告的國際會議資訊，包括會議名稱、會議地點、會議日期和論文接受日期。國際會議內容以純文字為主，加上會議內容的撰寫來自不同的佈告者且為公告性質的網站，內容多為佈告者以簡短的口語來表達並不具結構性，所以在資訊的整合與擷取上有一定的困難度，如何有效的擷取出正確的資訊，本篇論文運用機器學習的方式，讓電腦具有學習的能力，自動擷取來自不同佈告者公告的國際會議資訊，並且有不錯的效果。 With the progress of information technologies, the traditional sheets of paper are replaced by web pages rapidly. The versatilities and abundant contents in the web pages make the extraction of useful information far more difficult than before. Information extraction technology has allowed us to extract such information from non-structural data by means of a series of processes, such as arrangement, distillation and coalition. Due to the potential changes of infra-structure of web pages and the diversities of designers’ personal styles, the most straight-forward but may not so cost effective way is to construct extraction system manually in accordance with the characteristics of individual web site. Therefore, automated extraction is the most wanted goal to achieve. This thesis focuses on the extraction of conference information, such as conference names, locations, dates and accept paper dates, from DB World and international conference web pages. Since the bulletin-type conference web pages are not only text-rich but also written and published orally by different individuals without any structural harmonization, it makes the processes of integration and extraction rigorously. The system which is built on machine learning techniques is creditable and validated to perform well for the extraction of specific fields from cross web site pages.
显示于类别:	[資訊工程學系碩士在職專班 ] 博碩士論文

文件中的档案:

档案	大小	格式	浏览次数

在NCUIR中所有的数据项都受到原著作权保护.

社群 sharing

数据加载中.....