基於單網頁資料提取及綱要匹配之多網頁資料提取

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：46

、訪客IP：18.188.35.25

姓名

張智鈞(Chih-Chun Chang) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

基於單網頁資料提取及綱要匹配之多網頁資料提取
(Multipage schema induction based on single-page data extraction and schema matching)

相關論文

★ 行程邀約郵件的辨識與不規則時間擷取之研究	★ NCUFree校園無線網路平台設計及應用服務開發
★ 網際網路半結構性資料擷取系統之設計與實作	★ 非簡單瀏覽路徑之探勘與應用
★ 遞增資料關聯式規則探勘之改進	★ 應用卡方獨立性檢定於關連式分類問題
★ 中文資料擷取系統之設計與研究	★ 非數值型資料視覺化與兼具主客觀的分群
★ 關聯性字組在文件摘要上的探討	★ 淨化網頁：網頁區塊化以及資料區域擷取
★ 問題答覆系統使用語句分類排序方式之設計與研究	★ 時序資料庫中緊密頻繁連續事件型樣之有效探勘
★ 星狀座標之軸排列於群聚視覺化之應用	★ 由瀏覽歷程自動產生網頁抓取程式之研究
★ 動態網頁之樣版與資料分析研究	★ 同性質網頁資料整合之自動化研究

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

網際網路(WWW)是現代資訊傳播的主流媒體，許多應用服務仰賴網頁資料擷取(Web Data Extraction)技術支援資訊整合服務。雖然過去已有不少非監督式資料擷取方法被提出，但是考慮單網頁的多筆資料的擷取方法(如MDR)僅能處理本頁中的記錄集(RecordSet)，無法顧慮整體結構；而多網頁的對齊方法(如DCADE)雖能透過多頁資料辨識樣版與資料的區分，但是對於記錄集的辨識往往不夠韌性(Robust)，對於複雜網站往往無法完成擷取任務。本研究結合兩種方法的優點，先採用MDR對個別網頁進行資料集擷取，再將多網頁資料集擷取結果進行記錄集匹配(Recordset Matching)、行對齊(Column Alignment) 和非記錄集匹配(NonRecordset Matching)三項子任務。其中在記錄集的部分，我們利用了BERT sentence representation計算資料集中每筆資料的表示法，再搭配綱要匹配(Schema Matching)達到了記錄集匹配；同時應用KNN、SVM分類器完成行對齊任務;在非紀錄的部分則是利用DCADE多網頁數據擷取方法對於非記錄集進行多頁屬性對齊能力來完成；最終我們合併兩項結果，達到多分頁數據擷取。
除了ExAlg、WEIR資料集之外，我們也提供了一個網站最新消息公告資料集(Announcement List Website, ALW)，用來測試網站最新資訊或公告列表的自動資料效果。實驗結果顯示，我們提出的方法DEVOSM (Data Extraction via On-the-fly Schema Matching)
，在ExAlg、WEIR和ALW資料集上改善了55.6%、60%及33.7%的記錄集擷取率，顯示所提方法的有效性。

摘要(英)

WorldWideWeb (WWW) is the mainstream media of modern information dissemination. Many application services rely on Web Data Extraction technology to support information integration services.Although many unsupervised data capture methods have been proposed in the past, the multi-data capture method (such as MDR) that considers a single page can only process the RecordSet in this page, and it cannot consider the overall structure; While multi-page alignment methods (such as DCADE) can identify the difference between templates and data through multi-page data, the identification of record sets is often not robust enough, and it is often impossible to complete the retrieval task for complex websites.In this study, combining the advantages of the two methods, MDR is used to extract data sets from individual web pages, and then the data sets from multiple web pages are extracted for three sub-tasks: Recordset Matching, Column Alignment and NonRecordset Matching.In the record set part, we use the BERT sentence representation to calculate the representation of each data in the data set, and then use the Schema Matching to achieve the record set matching; at the same time, the KNN and SVM classifiers are used to complete the Column Alignment task; in The non-record part is accomplished by using the DCADE multi-page data capture method to perform multi-page attribute alignment capabilities for non-record sets; finally, we combine the two results to achieve multi-page data capture.In addition to ExAlg and WEIR data sets, we also provide a website latest news announcement data set (Announcement List Website, ALW), which is used to test the automatic data effect of the latest website news or announcement list. Experimental results show that our proposed method DEVOSM (Data Extraction via On-the-fly Schema Matching) improved by 55.6%、60% and 33.7% on ExAlg, WEIR and ALW datasets Recordset retrieval rate, showing the effectiveness of the proposed method.

關鍵字(中)

★ 網頁數據擷取
★ 綱要匹配

關鍵字(英)

★ Web data extraction
★ Schema matching
★ ETL

論文目次

中文摘要............................................................................................................... i
英文摘要............................................................................................................... ii
目錄...................................................................................................................... iii
圖目錄.................................................................................................................. iv
表目錄.................................................................................................................. v
一、緒論................................................................................................ 1
二、相關研究......................................................................................... 4
2.1 網頁數據擷取(Web data extraction) . . . . . . . . . . . . . . . . 4
2.1.1 單網頁之記錄集提取. . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 多網頁之記錄集提取. . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.3 ETL system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 綱要匹配(schema matching) . . . . . . . . . . . . . . . . . . . . 7
2.2.1 二線匹配器(second line matcher 2LM) . . . . . . . . . . . . . . 9
三、使用方法......................................................................................... 11
3.1 任務定義. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 記錄集對齊(Record set matching) . . . . . . . . . . . . . . . . . 12
3.3 行對齊(column matching) . . . . . . . . . . . . . . . . . . . . . 13
3.4 非記錄集對齊(non set matching) . . . . . . . . . . . . . . . . . . 15
3.5 多網頁數據擷取方法. . . . . . . . . . . . . . . . . . . . . . . . 17
四、實驗................................................................................................ 18
4.1 資料集. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.2 記錄集匹配實驗. . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.3 行對齊實驗. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.4 延伸方法. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
五、結論................................................................................................ 23

參考文獻

參考文獻
[1] Arvind Arasu and Hector Garcia-Molina. Extracting structured data from web
pages. In Proceedings of the 2003 ACM SIGMOD international conference on
Management of data, pages 337–348, 2003.
[2] David Aumueller, Hong-Hai Do, Sabine Massmann, and Erhard Rahm. Schema
and ontology matching with coma++. In Proceedings of the 2005 ACM SIGMOD
international conference on Management of data, pages 906–908, 2005.
[3] Lidong Bing, Wai Lam, and Tak-Lam Wong. Wikipedia entity expansion and
attribute extraction from the web using semi-supervised learning. In Proceedings
of the sixth ACM international conference on Web search and data mining,
pages 567–576, 2013.
[4] Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, and Paolo Papotti. Extraction
and integration of partially overlapping web sources. Proceedings of the VLDB
Endowment, 6(10):805–816, 2013.
[5] David Buttler, Ling Liu, and Calton Pu. A fully automated object extraction
system for the world wide web. In Proceedings 21st International Conference
on Distributed Computing Systems, pages 361–370. IEEE, 2001.
[6] Andrew Carlson, Justin Betteridge, Richard C Wang, Estevam R Hruschka Jr,
and Tom M Mitchell. Coupled semi-supervised learning for information extraction.
In Proceedings of the third ACM international conference on Web search
and data mining, pages 101–110, 2010.
[7] CH Chang. Information extraction based on pattern discovery. In Proc. of 10th
World Wide Web Conference, 2001.
[8] Chia-Hui Chang, Tian-Sheng Chen, Ming-Chuan Chen, and Jhung-Li Ding.
Efficient page-level data extraction via schema induction and verification. In
Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 478–
490. Springer, 2016.
[9] Yu-An Chou. Web data etl system with unsupervised extraction. 2018.
[10] Valter Crescenzi, Giansalvatore Mecca, Paolo Merialdo, et al. Roadrunner:
Towards automatic data extraction from large web sites. In VLDB, volume 1,
pages 109–118, 2001.
[11] Valter Crescenzi, Paolo Merialdo, and Disheng Qiu. Alfred: crowd assisted data
extraction. In Proceedings of the 22nd international conference on World Wide
Web, pages 297–300, 2013.
[12] Hong-Hai Do and Erhard Rahm. Coma—a system for flexible combination of
schema matching approaches. In VLDB’02: Proceedings of the 28th International
Conference on Very Large Databases, pages 610–621. Elsevier, 2002.
[13] Emilio Ferrara, Pasquale De Meo, Giacomo Fiumara, and Robert Baumgartner.
Web data extraction, applications and techniques: A survey. Knowledge-based
systems, 70:301–323, 2014.
[14] Avigdor Gal. Uncertain schema matching. Synthesis Lectures on Data Management,
3(1):1–97, 2011.
[15] Zvi Galil, Silvio Micali, and Harold Gabow. An o(evlogv) algorithm for finding
a maximal weighted matching in general graphs. SIAM Journal on Computing,
15(1):120–130, 1986.
[16] Patricia Jiménez and Rafael Corchuelo. On learning web information extraction
rules with tango. Information Systems, 62:74–103, 2016.
[17] Mohammed Kayed and Chia-Hui Chang. Fivatech: Page-level web data extraction
from template pages. IEEE transactions on knowledge and data engineering,
22(2):249–263, 2009.
[18] Teuvo Kohonen. The self-organizing map. Proceedings of the IEEE, 78(9):1464–
1480, 1990.
[19] Bing Liu, Robert Grossman, and Yanhong Zhai. Mining data records in web
pages. In Proceedings of the ninth ACM SIGKDD international conference on
Knowledge discovery and data mining, pages 601–606, 2003.
[20] Anan Marie and Avigdor Gal. Managing uncertainty in schema matcher ensembles.
In International Conference on Scalable Uncertainty Management, pages
60–73. Springer, 2007.
[21] Jan Portisch, Michael Hladik, and Heiko Paulheim. Background knowledge in
schema matching: A survey.
[22] Jianfeng Qu, Dantong Ouyang, Wen Hua, Yuxin Ye, and Xiaofang Zhou. Discovering
correlations between sparse features in distant supervision for relation
extraction. In Proceedings of the twelfth ACM international conference on web
search and data mining, pages 726–734, 2019.
[23] Erhard Rahm and Philip A Bernstein. On matching schemas automatically.
VLDB journal, 10(4):334–350, 2001.
[24] Tanvi Sahay, Ankita Mehta, and Shruti Jadon. Schema matching using machine
learning. In 2020 7th International Conference on Signal Processing and
Integrated Networks (SPIN), pages 359–366. IEEE, 2020.
[25] Sunita Sarawagi. Information extraction. found. trends databases 1, 3 (march
2008), 261–377, 2008.
[26] Roee Shraga, Avigdor Gal, and Haggai Roitman. Adnev: Cross-domain schema
matching using deep similarity matrix adjustment and evaluation. Proceedings
of the VLDB Endowment, 13(9):1401–1415, 2020.
[27] Hassan A Sleiman and Rafael Corchuelo. Tex: An efficient and effective unsupervised
web information extractor. Knowledge-Based Systems, 39:109–123,
2013.
[28] Oviliani Yenty Yuliana and Chia-Hui Chang. A novel alignment algorithm for
effective web data extraction from singleton-item pages. Applied Intelligence,
48(11):4355–4370, 2018.
[29] Oviliani Yenty Yuliana and Chia-Hui Chang. Dcade: divide and conquer alignment
with dynamic encoding for full page data extraction. Applied Intelligence,
50(2):271–295, 2020.

指導教授

張嘉惠(Chia-Hui Chang)

審核日期

2022-9-1

推文