行程邀約郵件的辨識與不規則時間擷取之研究

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：72

、訪客IP：3.137.184.33

姓名

吳忠翰(Chung-Han Wu) 查詢紙本館藏

畢業系所

資訊工程學系在職專班

論文名稱

行程邀約郵件的辨識與不規則時間擷取之研究
(Recognition of Invitation E-mails and Extraction of Irregular Time Expressions for Intelligent E-mail Systems)

相關論文

★ NCUFree校園無線網路平台設計及應用服務開發	★ 網際網路半結構性資料擷取系統之設計與實作
★ 非簡單瀏覽路徑之探勘與應用	★ 遞增資料關聯式規則探勘之改進
★ 應用卡方獨立性檢定於關連式分類問題	★ 中文資料擷取系統之設計與研究
★ 非數值型資料視覺化與兼具主客觀的分群	★ 關聯性字組在文件摘要上的探討
★ 淨化網頁：網頁區塊化以及資料區域擷取	★ 問題答覆系統使用語句分類排序方式之設計與研究
★ 時序資料庫中緊密頻繁連續事件型樣之有效探勘	★ 星狀座標之軸排列於群聚視覺化之應用
★ 由瀏覽歷程自動產生網頁抓取程式之研究	★ 動態網頁之樣版與資料分析研究
★ 同性質網頁資料整合之自動化研究	★ 時序性資料庫中未知週期之非同步週期性樣板的探勘

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 ( 永不開放)

摘要(中)

電子郵件是現代人最重要的通訊工具之一，不論是在工作上或是一般生活中，每天都會收到許多的電子郵件。而電子郵件中往往有許多重要的資訊，譬如會議或行程邀約郵件就會有事件時間的重要資訊，這些資訊若不經過人工的判別，並且手動將這些行程標註於行事曆中，則很可能就會讓此類重要資訊淹沒於大量郵件當中，而導致錯過重要的行程。面對此種問題，人們需要一套自動化的解決方案，但是郵件內容為非結構化文件，不易辨識是否為行程邀約，並且其中的時間，多是口語性的表達，亦不易辨識及擷取。
因此本研究希望建構一套系統，能夠辨識行程邀約郵件，再將這些行程邀約郵件中的時間表達字串擷取出來，做為日後提醒之依據。本系統分為兩個部份，第一部份是擷取郵件的特徵，藉由支持向量機分纇器，訓練出分類郵件的模型，來辨識行程邀約郵件。第二部份是將這些郵件中的時間資訊，採用條件隨機場域，訓練出標記時間表達字串的模型來萃取時間關鍵字，最後系統再透過Google Task API自動地將萃取出的行程加入於Google Task中。此機制可以減輕使用者人工判別的負擔，亦減少了錯失行程的機會。實驗結果顯示，本系統所提出之方法在邀約郵件的辨識上可達94.8的F-measure，在時間擷取上也可達到95.7的F-measure。

摘要(英)

Nowadays, E-mail reader is one of the most important communication tools. Many people receive a lot amount of e-mails in business or in daily life. Invitation e-mails often contain important information that need to be tracked for some time. Such messages might be forgotten easily if people do not handle it immediately and mark on their calendar right away. To deal with this issue, people need an automatic solution which can recognize invitation e-mails and the time expressions for later reminders. The challenge here is information extraction from non-structure free text.
This research proposed a system that would be able to recognize invitation e-mails and extract the time expressions via machine learning. This system is composed of two parts. The first part is utilizing the Support Vector Machine classifier to build a model that can predict the class of a new e-mail. The second part is utilizing Conditional Random Fields to build a model that can extract time expression from an e-mail. Finally, we can extract the time expression and append it in Google Task Service by using Google Task API. This mechanism can reduce the effort of reading e-mails and decrease the opportunity of missing events. The proposed methods achieve a 94.8 and 95.7 F-score for recognizing invitation e-mail and time expressions, respectively.

關鍵字(中)

★ 不規則時間的擷取
★ 行程邀約郵件的辨識

關鍵字(英)

★ Recognition of Invitation E-mails
★ Extraction of Irregular Time Expressions

論文目次

摘要 i
Abstract ii
誌謝 iii
目錄 iv
圖目錄 vi
表目錄 vii
一、緒論 1
1-1 研究動機 1
1-2 研究背景 2
1-3 章節概要 3
二、相關研究 4
2-1 郵件分類 4
2-2 資訊擷取 7
2-2-1 基於人工規則的方法 7
2-2-2 基於機器學習的方法 8
三、研究方法 10
3-1 行程邀約郵件辨識 10
3-1-1 郵件分類前置處理(Pre-processing) 11
3-1-2 郵件分類特徵擷取(Feature Extraction) 12
3-1-3 郵件分類學習模組(Learning Module) 15
3-2 時間表達字串擷取 17
3-2-1 序列標記前置處理(Pre-processing) 18
3-2-2 序列標記特徵擷取(Feature Extraction) 19
3-2-3 序列標記學習模組(Learning Module) 21
3-2-4 時間表達字串擷取 25
四、實驗 26
4-1 實驗資料來源 26
4-2 行程邀約郵件辨識實驗 27
4-3 時間表達字串擷取實驗 28
五、系統架構設計及實作 35
5-1 各種開發架構 35
5-2 各開發架構比較 37
5-3 系統整合架構 39
5-4 系統操作畫面 41
六、結論與未來展望 43
參考文獻 44
附錄一 47

參考文獻

[1] H. Drucker, D. Wu, V. N. Vapnik, “Support Vector Machines for Spam Categorization”, IEEE Transactions on Neural Networks, vol. 10, pp. 1048-1054, 1999.
[2] A. Kolcz, J. Alspector, “SVM-based Filtering of E-mail Spam with Content-specific Misclassification Costs”, Proceedings of the TextDM’01 Workshop on Text Mining, 2001.
[3] M. Stamp, “A Revealing Introduction to Hidden Markov Models”, 2012, http://www.cs.sjsu.edu/faculty/stamp/RUA/HMM.pdf
[4] J. D. M. Rennie, “ifile: An Application of Machine Learning to E-Mail Filtering” Proceedings of the KDD-2000 Workshop on Text Mining, 2000.
[5] L. Breiman, J. H. Friedman, R. A. Olshen, C. J. Stone, “Classification and Regression Trees”, Wadsworth International Group, Belmont, CA, 1984.
[6] J. R. Quinlan, “C4.5: Program for Machine Learning”, Morgen Kaufmann Publisher, San Francisco, CA, 1993.
[7] M. Mehta, R. Agrawal, J. Rissanen, “SLIQ: A Fast Scalable Classifier for Data Mining” Proceesings of the Extending Database Technology, pp. 18-32, 1996.
[8] R. Rastogi, K. Shim, “PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning” Proceedings of the DMKD-2000, pp. 315-344, 2000.
[9] M. Mehta, J. Rissanen, R. Agrawal, “MDL-Based Decision Tree Purning” Proceesings of the KDD’95, pp. 216-221, 1995.
[10] I. Mani, G Wilson, “Robust Temporal Processing of News” Proceedings of the ACL-2000, pp. 69-76, 2000.
[11] M. Negri, L. Marseglia, “Recognition and Normalization of Time Expressions”, ITC-irst at TERN-2004, Technical Report WP3.7, 2004.
[12] D. Freitag, “Information Extraction from HTML: Application of a General Machine Learning Approach”, Proceedings of the AAAI/IAAI’98, pp. 517-523, 1998.
[13] T. G. Dietterich, “Machine Learning for Sequential Data: A Review” Proceedings of the Joint IAPR International Workshop on Structural, Syntactic, and Statistical Pattern Recognition, pp. 15-30, 2002.
[14] D. Ahn, S. F. Adafre, M. Rijke, “Extracting Temporal Information from Open Domain Text: A Comparative Exploration”, Journal of Digital Information Management, pp. 14-20, 2005.
[15] K. Hacioglu, Y. Chen, B. Douglas, “Automatic Time Expression Labeling for English and Chinese Text”, Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics, pp. 548-559, 2005.
[16] J. Poveda, M. Surdeanu, J. Turmo, “A Comparison of Statistical and Rule-Induction Learners for Automatic Tagging of Time Expressions in English”, Proceedings of the International Symposium on Temporal Representation and Reasoning, pp. 141-149, 2007.
[17] C. N. Seon, H. Kim, J. Seo, “Efficient Appointment Information Extraction from Short Messages in Mobile Devices with Limited Hardware Resources”, Journal of Pattern Recognition Letters, pp. 127-133, 2011.
[18] R. Klinger, K. Tomanek, “Classical Probabilistic Models and Conditional Random Fields”, Algorithm Engineering Report TR07-2-013, Department of Computer Science, Dortmund University of Technology, 2007.
[19] C. Sutton, A. McCallum, “An Introduction to Conditional Random Fields for Relational Learning” in “Introduction to Statistical Relational Learning”, MIT Press, 2006.
[20] Y. Liu, E. Shriberg, A. Stolcke, M. Harper, “Comparing HMM, Maximum Entropy and Conditional Random Fields for Disfluency Detection”, Proceedings of the European Conference on Speech Communication and Technology, pp. 3313-3316, 2005.
[21] H. M. Wallach, “Conditiondal Randiom Fields: An Introduction”, Technical Report MS-CIS-04-21. Department of Computer and Information Science, University of Pennsylvania, 2004.
[22] T. Kudo, “CRF++: Yet Another CRFtoolkit”, 2005, http://crfpp.googlecode.com/svn/trunk/doc/index.html
[23] B. E. Boser, I. M. Guyon, V. N. Vapnikm, “A Training Algorithm for Optimal Margin Classifiers”, Proceedings of the ACM’92 Workshop on Computational Learning Theory, pp. 144-152, 1992.
[24] R. E. Fan, K. W. Chang, C. J. Hsieh, X. R. Wang, C. J. Lin. “LIBLINEAR: A Library for Large Linear Classification”, Journal of Machine Learning Research, pp. 1871-1874, 2008.
[25] M. Sahami, S. Dumais, D. Heckerman, E. Horovitz, “A Bayesian Approach to Filtering Junk E-Mail”, Proceesings of the AAAI’98 Workshop on Learning for Text Categorization, 1998.
[26] G. Boone, “Concept Features in Re: Agent, an Intelligent Email Agent”, Proceedings of the International Conference on Autonomous Agents, pp. 141–148, 1998.

指導教授

張嘉惠(Chia-hui Chang)

審核日期

2012-8-16

推文