使用隱藏式馬可夫模型之特定網頁資訊抓取蒐集

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：75

、訪客IP：3.135.195.91

姓名

施宗昆(Tsung-Kun Shih) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

使用隱藏式馬可夫模型之特定網頁資訊抓取蒐集
(Focused Crawling for Information Gathering Using Hidden Markov Model)

相關論文

★ 行程邀約郵件的辨識與不規則時間擷取之研究	★ NCUFree校園無線網路平台設計及應用服務開發
★ 網際網路半結構性資料擷取系統之設計與實作	★ 非簡單瀏覽路徑之探勘與應用
★ 遞增資料關聯式規則探勘之改進	★ 應用卡方獨立性檢定於關連式分類問題
★ 中文資料擷取系統之設計與研究	★ 非數值型資料視覺化與兼具主客觀的分群
★ 關聯性字組在文件摘要上的探討	★ 淨化網頁：網頁區塊化以及資料區域擷取
★ 問題答覆系統使用語句分類排序方式之設計與研究	★ 時序資料庫中緊密頻繁連續事件型樣之有效探勘
★ 星狀座標之軸排列於群聚視覺化之應用	★ 由瀏覽歷程自動產生網頁抓取程式之研究
★ 動態網頁之樣版與資料分析研究	★ 同性質網頁資料整合之自動化研究

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

現今在網路上最主要的活動就是資訊的搜尋，雖然目前的搜尋引擎已經相當的好用了，但是它們仍然有些缺點需要去改進。很多人們的資訊需求是很難用關鍵字為基礎的查詢，就能得到正確的回傳結果，所以在本篇論文裡，我們建立一個名為隱藏式的馬可夫模型，來預測最有可能的網頁路徑，進而達到蒐集特定資訊的目的，而在實驗結果方面也顯示了我們的系統改善了一些搜尋引擎所面臨的一些缺點。

摘要(英)

Information search is the key activity for many users on the Web. Although search engines are very useful and powerful nowadays, there are also many drawbacks faced by them. Moreover, many information needs are hard to express using keyword-based queries. In this paper, we apply a method to solve composite information needs by building a Hidden Markov Model (HMM) for predicting the most likely path to the target information. We want to use the concept of the focused crawling to trace down a Web site for specific information. The experiment shows that the results is good for the admission information and the accepted papers.

關鍵字(中)

★ 馬可夫鏈
★ 資訊蒐集

關鍵字(英)

★ HMM
★ Information Gathering

論文目次

1. INTRODUCTIONS ............................................................................................. 1
2. RELATED WORK .............................................................................................. 4
2.1 GENERAL TOPIC ........................................................................................... 4
2.2 FOCUSED TOPIC ........................................................................................... 4
2.3 DEEP WEB ...................................................................................................... 8
3. SYSTEM OVERVIEW...................................................................................... 10
3.1 Hidden Markov Model Construction ............................................................. 11
3.1.1 Collecting User Browsing Sequence .................................................. 11
3.1.2 Concept Graph Construction............................................................... 12
3.1.3 The Construction of Hidden Markov Model....................................... 13
3.2 EXECUTION ................................................................................................. 17
4. EXPERIMENTS ................................................................................................ 19
5. CONCLUSIONS ................................................................................................ 33
6. REFERENCE ..................................................................................................... 34

參考文獻

1. Aggarwal, C. C., Al-Garawi, F., and Yu, P. S. 2001. Intelligent crawling on the World Wide Web with arbitrary predicates. In Proceedings of the 10th international Conference on World Wide Web (Hong Kong, Hong Kong, May 01 - 05, 2001). WWW '01. ACM Press, New York, NY, 96-105.
2. Chakrabarti, S., Punera, K., and Subramanyam, M. 2002. Accelerated focused crawling through online relevance feedback. In Proceedings of the 11th international Conference on World Wide Web (Honolulu, Hawaii, USA, May 07 - 11, 2002). WWW '02. ACM Press, New York, NY, 148-159.
3. Diligenti, M., Coetzee, F., Lawrence, S., Giles, C. L., and Gori, M. 2000. Focused Crawling Using Context Graphs. In Proceedings of the 26th international Conference on Very Large Data Bases (September 10 - 14, 2000). A. E. Abbadi, M. L. Brodie, S. Chakravarthy, U. Dayal, N. Kamel, G. Schlageter, and K. Whang, Eds. Very Large Data Bases. Morgan Kaufmann Publishers, San Francisco, CA, 527-534.
4. Fontes, A. d. and Silva, F. S. 2004. SmartCrawl: a new strategy for the exploration of the hidden web. In Proceedings of the 6th Annual ACM international Workshop on Web information and Data Management (Washington DC, USA, November 12 - 13, 2004). WIDM '04. ACM Press, New York, NY, 9-15.
5. J. Cho, H. Garcia-Molina, and L. Page. Efficient crawling through url ordering. In Proceedings of the Seventh International World Wide Web Conference, Brisbane, Australia, 1998.
6. Liu, H., Milios, E., and Janssen, J. 2004. Probabilistic models for focused web crawling. In Proceedings of the 6th Annual ACM international Workshop on Web information and Data Management (Washington DC, USA, November 12 - 13, 2004). WIDM '04. ACM Press, New York, NY,
7. Menczer, F., Pant, G., Srinivasan, P., and Ruiz, M. E. 2001. Evaluating topic-driven web crawlers. In Proceedings of the 24th Annual international ACM SIGIR Conference on Research and Development in information Retrieval (New Orleans, Louisiana, United States). SIGIR '01. ACM Press, New York, NY, 241-249.
8. M. Ester, H.-P. Kriegel, and M. Schubert. Accurate and efficient crawling for relevant websites. In Proceedings of the 30th international Conference on Very Large Data Bases (Toronto Canada, August31-September3, 2004). VLDB’04. 396-407.
9. Najork, M. and Wiener, J. L. 2001. Breadth-first crawling yields high-quality pages. In Proceedings of the 10th international Conference on World Wide Web (Hong Kong, Hong Kong, May 01 - 05, 2001). WWW '01. ACM Press, New York, NY, 114-118.
10. Pandey, S. and Olston, C. 2005. User-centric Web crawling. In Proceedings of the 14th international Conference on World Wide Web (Chiba, Japan, May 10 - 14, 2005). WWW '05. ACM Press, New York, NY, 401-411.
11. Raghavan, S. and Garcia-Molina, H. 2001. Crawling the Hidden Web. In Proceedings of the 27th international Conference on Very Large Data Bases (September 11 - 14, 2001). P. M. Apers, P. Atzeni, S. Ceri, S. Paraboschi, K. Ramamohanarao, and R. T. Snodgrass, Eds. Very Large Data Bases. Morgan Kaufmann Publishers, San Francisco, CA, 129-138.
12. S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In Proceedings of the Seventh International World Wide Web Conference, Brisbane, Australia, 1998.
13. Google Soap Search API, http://code.google.com/apis/soapsearch/
14. JAMA: A Java Matrix Package, http://math.nist.gov/javanumerics/jama/
15. Jahmm-Hidden Markov Model: An Implementation in Java, http://www.run.montefiore.ulg.ac.be/~francois/software/jahmm/
16. JDIC: JDesktop Integration Components, https://jdic.dev.java.net/
17. Jeff Heaton. Programming Spiders, Bots, and Aggregators in Java. Book ISBN: 0782140408, http://www.jeffheaton.com/java/bot/
18. K-means Clustering Tool, http://www.javaworld.com/javaworld/jw-11-2006/jw-1121-thread.html
19. K-Nearest-Neighbor, http://ww2.cs.fsu.edu/~chap/projects/knn/
20. LSI: Latent Semantic Indexing Tool, http://www.cs.utk.edu/~lsi/
21. String Edit Distance, http://en.wikipedia.org/wiki/Levenshtein_distance
22. Web Crawler, http://en.wikipedia.org/wiki/Web_crawling
23. Weka 3: Data Mining Software in Java, http://www.cs.waikato.ac.nz/ml/weka/
24. Wikipedia: http://en.wikipedia.org/wiki/Main_Page
25. WVTool: The World Vector Tool, http://nemoz.org/joomla/index.php?option=com_content&task=view&id=43&Itemid=83

指導教授

張嘉惠(Chia-Hui Chang)

審核日期

2007-10-11

推文