基於多元化部落格網頁之自動化擷取部落格主要文章

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：48

、訪客IP：18.117.162.117

姓名

陳志銘(Jhih-ming Chen) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

基於多元化部落格網頁之自動化擷取部落格主要文章
(Automatic Extraction of Blog Post from Diverse Blog Pages)

相關論文

★ 行程邀約郵件的辨識與不規則時間擷取之研究	★ NCUFree校園無線網路平台設計及應用服務開發
★ 網際網路半結構性資料擷取系統之設計與實作	★ 非簡單瀏覽路徑之探勘與應用
★ 遞增資料關聯式規則探勘之改進	★ 應用卡方獨立性檢定於關連式分類問題
★ 中文資料擷取系統之設計與研究	★ 非數值型資料視覺化與兼具主客觀的分群
★ 關聯性字組在文件摘要上的探討	★ 淨化網頁：網頁區塊化以及資料區域擷取
★ 問題答覆系統使用語句分類排序方式之設計與研究	★ 時序資料庫中緊密頻繁連續事件型樣之有效探勘
★ 星狀座標之軸排列於群聚視覺化之應用	★ 由瀏覽歷程自動產生網頁抓取程式之研究
★ 動態網頁之樣版與資料分析研究	★ 同性質網頁資料整合之自動化研究

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

近年來，部落格為主的相關研究蓬勃發展，例如：意見檢索、情緒分析。因此，擷取部落格的主要文章即是一個不可或缺的步驟。在此篇論文中，我們將探討如何從各式各樣的部落格網頁精確且自動化的擷取部落格的主要文章。許多先前的研究著重於擷取新聞網頁的主要文章，若將其應用於部落格網頁並無顯著的效果，這是由於部落格網頁風格五花八門且文章內容包含多種格式，致使擷取部落格主文變得較為複雜。針對此問題，我們結合MSS [24] 和CETR [34] 這兩篇論文的研究並加以修改調整，提出兩個部落格主文擷取的方法。第一個方法為PTR Scoring，結合了Post-to-Tag Ratio和Maximum Scoring Subsequence，是一個非監督式演算法。第二個方法為CRF Scoring，透過Conditional Random Fields此機率模型並利用Maximum Scoring Subsequence提升擷取的準確率。實驗結果顯示CRF Scoring的F-Measure可達到91.9%，是本篇論文中準確率最高的擷取方法。本篇論文所提出之方法可應用於PDA、手機…等螢幕較小的裝置，以及提升部落格搜尋引擎的效能，並提供後續相關研究之參考與幫助。

摘要(英)

With the rapid development of the blogosphere, blog post extraction is an essential task for researches on blogosphere. However, very little attention has been given specifically to blog post extraction. In this paper, we address the issue of extracting blog posts from diverse blog pages, which aims at automatically and precisely finding the location of each blog post. Most of previous researches focused on extracting main content from news pages, but the problem becomes more complex when one turns to blog pages, since some blog posts may employ a variety of content formats concurrently and miscellaneous information could negatively affect the accuracy of extraction. Our research is based on the combination of MSS [24] and CETR [34] to develop algorithms that are suitable for blog pages. The 1st method that we propose is PTR Scoring, which combines Post-to-Tag Ratio with maximum scoring subsequence. The 2nd method is CRF Scoring, which applies Conditional Random Field to train models and use maximum scoring subsequence to improve the accuracy of extraction. The experimental results show that CRF Scoring achieves the best F-Measure at 91.9% among existing methods.

關鍵字(中)

★ 最大加總子序列
★ 序列標記
★ 資訊檢索
★ 部落格

關鍵字(英)

★ blog post extraction
★ sequence labeling
★ maximum subsequence

論文目次

中文摘要　I
Abstract　II
誌謝　III
Table of Contents　IV
List of Figures　V
List of Tables　VI
1.　Introduction　1
2.　Related Work　6
　2.1　Content Extraction　6
　2.2　Application　10
3.　Our Proposed Method　11
　3.1　Unsupervised Blog Post Extraction with PTR Scoring　12
　　3.1.1.　Post-to-Tag Ratio　13
　　3.1.2.　Smoothing Function　15
　　3.1.3.　Maximum Scoring Subsequence　16
　3.2　Supervised Blog Post Extraction with CRF Scoring　17
　　3.2.1.　Feature Extraction　17
　　3.2.2.　Conditional Random Field　18
　　3.2.3.　Applying Maximum Scoring Subsequence　20
4.　Experiments　21
　4.1　Experimental Setup　21
　4.2　Performance Study on Unsupervised Blog Post Extraction　23
　4.3　Performance Study on Supervised Blog Post Extraction　26
　4.4　Discussion　28
5.　Conclusion & Future Work　30
Reference 　31

參考文獻

[1] L. Bing, Y. Wang, Y. Zhang and H. Wang. “Primary Content Extraction with Mountain Model”, CIT, IEEE, 2008, pp. 479–484.
[2] D. Cai, S. Yu, J. R. Wen and W. Y. Ma. “VIPS: a Vision-based Page Segmentation Algorithm”, Microsoft Technical Report, MSR-TR-2003-79, 2003.
[3] D. Cao and X. Liao and S. Bai. “Blog Post and Comment Extraction Using Information Quantity of Web Format”, AIRS, ACM, 2008, pp. 298–309.
[4] S. Debnath, P. Mitra, and C. L. Giles. “Automatic extraction of informative blocks from webpages”, SAC, ACM, 2005, pp. 1722–1726.
[5] S. Debnath, P. Mitra, and C. L. Giles. “Identifying content blocks from web documents”, ISMIS, 2005, pp. 285–293.
[6] E. Elgersma and M. de Rijke. “Learning to Recognize Blogs: A Preliminary Exploration”, ECAL Workshop, 2006.
[7] A. Finn, N. Kushmerick, and B. Smyth. “Fact or fiction: Content classification for digital libraries”, DELOS Workshop, 2001.
[8] J. Gibson, B. Wellner, S. Lubar. “Adaptive Web-page Content Identification”, WIDM, ACM, 2007, pp. 105-112.
[9] T. Gottron. “Evaluating content extraction on html documents”, ITA, 2007, pp. 123–132.
[10] T. Gottron. “Combining content extraction heuristics: the combine system”, iiWAS, ACM, 2008, pp. 591–595.
[11] T. Gottron. “Content code blurring: A new approach to content extraction”, DEXA, IEEE, 2008, pp. 29–33.
[12] Y. Guo, H. Tang, L. Song, Y. Wang and G. Ding. “ECON: An Approach to Extract Content from Web News Page”, APWEB, IEEE, 2010, pp. 314–320.
[13] S. Gupta, G. E. Kaiser, P. Grimm, M. F. Chiang, and J. Starren. “Automating content extraction of html documents”, WWW, ACM, 2005, pp. 179–224.
[14] S. Gupta, G. E. Kaiser, D. Neistadt, and P. Grimm. “Dom-based content extraction of html documents”, WWW, ACM, 2003, pp. 207–214.
[15] S. Gupta, G. E. Kaiser, and S. J. Stolfo. “Extracting context to improve accuracy for html content extraction”, WWW, ACM, 2005, pp. 1114–1115.
[16] W. Han, D. Buttler, and C. Pu. “Wrapping web data into xml”, SIGMOD, ACM, 2001, pp. 33–38.
[17] P. Kolari, A. Java, T. Finin, T. Oates and A. Joshi. “Detecting Spam Blogs: A Machine Learning Approach”, AAAI, ACM, 2006, pp. 1351−1356.
[18] J. Lafferty, A. McCallum, and F. Pereira. “Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data”, ICML, ACM, 2001, pp. 282–289.
[19] J. Liu, L. Birnbaum and B. Pardo. “Categorizing Blogger’s Interests Based on Short Snippets of Blog Posts”, CIKM, ACM, 2008, pp. 1525–1526.
[20] C. Mantratzis, M. A. Orgun, and S. Cassidy. “Separating XHTML content from navigation clutter using DOM-structure block analysis”, Hypertext, ACM, 2005, pp. 145–147.
[21] M. Marek, P. Pecina and M. Spousta. “Web Page Cleaning with Conditional Random Fields”, WWW, vol. 5, 2007, pp. 1−8.
[22] G. Mishne and M. de Rijke. “Deriving Wishlists from Blogs”, WWW, ACM, 2006, pp. 925–926.
[23] I. Ounis, M. de Rijke, C. Macdonald, G. Mishne, and I. Soboroff. “Overview of the TREC-2006 Blog Track”, TREC, 2006.
[24] J. Pasternack and D. Roth. “Extracting article text from the web with maximum subsequence segmentation”, WWW, ACM, 2009, pp. 971–980.
[25] D. Pinto, M. Branstein, R. Coleman, W. B. Croft, M. King, W. Li, and X. Wei. “Quasm: a system for question answering using semi-structured data”, JCDL, ACM, 2002, pp. 46–55.
[26] M.F. Porter. “An algorithm for suffix stripping”, Program, vol. 14, no. 3, 1980, pp. 130−137.
[27] A. F. R. Rahman, H. Alam and R. Hartono. “Content Extraction from HTML Documents”, WDA, 2001, pp. 7–10.
[28] W. L. Ruzzo and M. Tompa. “A Linear Time Algorithm for Finding All Maximal Scoring Subsequences”, AAAI Press, ACM, 1999, pp. 234–241.
[29] L. Song, X. Cheng, Y. Guo, B. Wu and Y. Wang. “Blog Post Extraction Using Title Finding”, Chinese Academy of Sciences, 2009.
[30] R. Song, H. Liu, J. R. Wen, and W. Y. Ma. “Learning Important Models for Web Page Blocks based on Layout and Content Analysis”, SIGKDD, ACM, 2004, pp. 14−23.
[31] H. M. Wallach. “Efficient Training of Conditional Random Fields”, CLUK Research Colloquium, University of Edinburgh, 2002.
[32] H. M. Wallach. “Conditional Random Fields: An Introduction”, Technical Report MS-CIS-04-21, Univ. of Pennsylvania, 2004.
[33] T. Weninger and W. H. Hsu. “Text Extraction from the Web via Text-to-Tag Ratio”, iiWas, ACM, 2008, pp. 23–28.
[34] T. Weninger, W. H. Hsu and J. Han. “CETR – Content Extraction via Tag Ratios”, WWW, ACM, 2010, pp. 971–980.
[35] L. Yang, C. Li and M. Gu. “Extracting Content from Web Pages Using the Sliding Window”, CSA, IEEE, 2009, pp. 1–6.
[36] P. H. Yang and C. H. Chang. “Automatic Labeling for Blog Post Extraction”, NCS, Taiwan, 2009.

指導教授

張嘉惠(Chia-hui Chang)

審核日期

2011-7-22

推文