摘要(英) |
With the progress of information technologies, the traditional sheets of paper are replaced by web pages rapidly. The versatilities and abundant contents in the web pages make the extraction of useful information far more difficult than before. Information extraction technology has allowed us to extract such information from non-structural data by means of a series of processes, such as arrangement, distillation and coalition. Due to the potential changes of infra-structure of web pages and the diversities of designers’ personal styles, the most straight-forward but may not so cost effective way is to construct extraction system manually in accordance with the characteristics of individual web site. Therefore, automated extraction is the most wanted goal to achieve.
This thesis focuses on the extraction of conference information, such as conference names, locations, dates and accept paper dates, from DB World and international conference web pages. Since the bulletin-type conference web pages are not only text-rich but also written and published orally by different individuals without any structural harmonization, it makes the processes of integration and extraction rigorously. The system which is built on machine learning techniques is creditable and validated to perform well for the extraction of specific fields from cross web site pages. |
參考文獻 |
[1] Dayne Freitag. Information Extraction from HTML: Application of a General Machine Learning Approach. In Proceedings of the Fifteenth national Conference on Artificial Intelligence, pages 517–523, 1998.
[2] Dayne Freitag. Machine Learning for Information Extraction in Information Domains. Ph.D. thesis, Carnegie Mellon University, 1998.
[3] M.E. Califf, and R.J. Mooney. Relational learning of pattern-match rules for information extraction. In Proceedings of the 16th National Conference on AI, 328-334, 1999.
[4] M.E. Califf, and R.J. Mooney. Bottom-Up Relational Learning of Pattern Matching Rules for Information Extraction. Journal of Machine Learning Research 4 (2003) 177-210
[5] M.E. Califf, Ph.D. Relational Learning Techniques of Natural Language Information Extraction. The University of Texas at Austin, 1998. Technical Report AI98-269
[6] I. Muslea, S. Minton, and C. Knoblock, A hierarchical approach to wrapper induction. In Proceedings of 3rd International Conference on Autonomous Agents(Agents-99),pp. 190-197, Seattle, Washington,1999
[7] Chun-Nan Hsu. Initial Results on Wrapping Semi-structured Web Pages with Finite-State Transducers and Contextual Rules. In Proceedings of AAAI-98 Workshop on AI and Information Integration, Technical Report WS-98-01. 1998.
[8] Chun-Nan Hsu. and Chien-Chi Chang. Finite-state transducers for semi-structured text mining. In Proceedings of IJCAI-99 Workshop on Text Mining: Foundations, Techniques and Applications, pp. 38-49, Stockholm, Sweden, 1999.
[9] C. H. Chang and S.C. Lui. IEPAD: Information Extraction Based on Pattern Discovery. In Proceedings of 10th International Conference on World Wide Web, pp. 681-688, 2001.
[10] J. Wang, and F.H. Lochovsky. Data Extraction and Label Assignment for Web Databases. In Proceedings of the twelfth international conference on Wide Web, Page 187 - 96, 2003.
[11] B. Liu, R. Grossman, and Y. Zhai. Mining Data Records in Web Pages. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Date Mining (KDD’03), Page 24 - 27, 2003
[12] Muggleton, S. , and Feng, C. Efficient induction of Logic Programs. In Muggleton, S., ed., Inductive Logic Programming. New York: Academic Press. 281-297, 1992.
[13] Zelle, J. M., and Mooney, R. J. Combining Top-down and bottom-up methods in inductive logic programming. In Proceedings of the Eleventh Internatinal on Machine Learning, 343-351. 1994
[14] Muggleton, S. Inverse entailment and Progol. New Generation Computing Journal 13:245 – 286. 1995
[15] Developing Language Processing Components with GATE Version 3 (a User Guide) , http://gate.ac.uk/sale/tao The University of Sheffield 2001-2005
[16] GATE – An Application Developer’s Guide http://www.dcs.shef.ac.uk/~valyt Department of Computer Science University of Sheffield, UK. 19 July 2004
[17] Tom Kenter, Diana Maynard Using GATE as an Annotation Tool 28th January 2005
[18] Tom M. Mitchell, carnegie Mellon University, Machine Learning
[19] Jiawei Han, Micheline Kamber, Data Ming concepts and Techniques
[20] Richard J. Roiger, Michael W. Geatz, Data Mining A Tutorial-Based Primer
[21] Weka The University of Waikato http://www.cs.waikato.ac.nz/ml/weka/
[22] Coenen, F. LUCS-KDD implementations of the FOIL, PTM and CPAR algorithms, http://www.cxc.liv.ac.uk/~frans/KDD/Software/FOIL_PRM_CPAR/,Department of
Science, The University of Liverpool, UK. (2004)
[23] C.J.C. Burges, A tutorial on support vector machines for pattern recognition, Data Mining Knowledge Discovery, 2, pp. 121-167,1998
[24] Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin A Practical Guide to Support Vector Classification Department of Computer Science and Information Engineering NTU
[25] LIBSVM, http://www.csie.ntu.edu.tw/~cjlin/libsvm/ |