English  |  正體中文  |  简体中文  |  Items with full text/Total items : 66984/66984 (100%)
Visitors : 22925488      Online Users : 144
RC Version 7.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.
Scope Tips:
  • please add "double quotation mark" for query phrases to get precise results
  • please goto advance search for comprehansive author search
  • Adv. Search
    HomeLoginUploadHelpAboutAdminister Goto mobile version


    Please use this identifier to cite or link to this item: http://ir.lib.ncu.edu.tw/handle/987654321/8717


    Title: 以網頁識別及清理改善資料擷取的研究;Web page Classification and Cleaning for Information Extraction
    Authors: 劉仁宇;Jen-Yu Liu
    Contributors: 資訊工程學系碩士在職專班
    Keywords: 資料擷取;機器學習;Information extraction;machine learning
    Date: 2006-07-07
    Issue Date: 2009-09-22 11:33:29 (UTC+8)
    Publisher: 國立中央大學圖書館
    Abstract: 網際網路使用的普及,豐富資訊不斷量增下,使用者面臨最大難題不在於資訊內容的多寡,而在於擷取出的資料能否符合實際所需。在網頁內容擷取最常遇到兩項困難:一是目標區域外,會有一些無關的資料;在目標區域內,也會夾雜著少許雜訊,影響擷取的正確性;然而真正擷取的目標內容,卻也因字詞與字詞間沒有嚴謹的文法及界限,而無法完整識別。 基於此理由,本篇論文希藉由網頁清理技術來達成資料擷取的正確性。我們採用SVM分類器,配合頁面清理技術做為實際擷取的輸入網頁;另外在資料擷取上,採用SoftMealy擷取器,以Induction rule的演算法產出擷取規則。依據此種概念,提出CBIE(Cleaning Based Information Extraction)。我們的實驗從DBWorld中已確認Accepted paper公佈時程的各Conferences網站,辨識Accepted paper所在的網頁,再經由頁面清理擷取其中論文題目與作者,其結果顯示有相當程度改善效果,也證明頁面清理想法的可行性。 As the popularization of internet, one puzzle the users may be forced to face is not the large quantity of information, but the difficulty to extract the information they desired from the web pages. In web Information extraction, the researchers are confronted by at least two difficulties which may decrease the precision and accuracy of the results. The first is the irrelevant data that appears outside the target areas. The second is the noisy information garbled with desired contents inside the target areas. In addition to these, the desired contents may not be identified completely due to the lack of clear separator. The purpose of this thesis is to solve those difficulties during web information extraction by incorporating page cleaning techniques. We use Support Vector Machine (SVM) to train a classifier for page cleaning. The cleaned pages are them applied to generated extraction rules by SoftMealy. The proposed idea, called CBIE(Cleaning Based Information Extraction), was applied on the extraction of paper titles and authors from accepted papers identified from websites the result shows that the cleaned pages were higher extractor performance them original web pages.
    Appears in Collections:[資訊工程學系碩士在職專班 ] 博碩士論文

    Files in This Item:

    File SizeFormat
    0KbUnknown611View/Open


    All items in NCUIR are protected by copyright, with all rights reserved.

    社群 sharing

    ::: Copyright National Central University. | 國立中央大學圖書館版權所有 | 收藏本站 | 設為首頁 | 最佳瀏覽畫面: 1024*768 | 建站日期:8-24-2009 :::
    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library IR team Copyright ©   - Feedback  - 隱私權政策聲明