以作者查詢圖書館館藏 、以作者查詢臺灣博碩士 、以作者查詢全國書目 、勘誤回報 、線上人數:33 、訪客IP:18.119.124.52
姓名 周家宇(CHIA-YU CHOU) 查詢紙本館藏 畢業系所 資訊工程學系 論文名稱 基於餘弦和模糊相似度方法之漸進式企業電子郵件分類
(Incremental Enterprise Email Classification Based on Cosine and Fuzzy Similarity Approaches)相關論文 檔案 [Endnote RIS 格式] [Bibtex 格式] [相關文章] [文章引用] [完整記錄] [館藏目錄] [檢視] [下載]
- 本電子論文使用權限為同意立即開放。
- 已達開放權限電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。
- 請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
摘要(中) 由於現今網路的發達以及方便性,使得電子郵件的使用量大幅上升。許多企業也將電子郵件視為與客戶或是企業內部員工相互傳訊的重要管道,因此對於公司而言,企業電子郵件系統的控管也變得相對重要。然而,許多員工利用企業的郵件系統傳送私人信件的情況是無可避免的。此現象帶來的後果是,私人郵件不但佔用郵件系統的頻寬造成系統效能降低,甚至可能造成企業的重要商業郵件延遲或無法順利寄出,造成公司商業上的損失。而且隨著隱私權意識的抬頭,如何在不監控郵件內容的情況下,將私人與商業郵件進行分類,以提升公司的商業效益,為本研究的目的。
為達到此目的,本研究只使用電子郵件之表頭資料(Header)而非郵件內容,雖然可能會降低分類的準確度,但卻能保護員工的隱私權。利用萃取出來的表頭資料,使用餘弦和模糊相似度的方法進行企業電子郵件的分類。更重要的是,本研究提出的漸進式系統可有效地避免處理累計的龐大郵件資料量,而且也考慮到隨著時間的改變,公司內部的人員流動或是客戶族群的變動問題。
摘要(英) Nowadays, the usage amounts of email have increased because Internet becomes more common. Many enterprises regard email as an essential way for business in contacting with customers or employees. Therefore, the management of email system becomes even more important for an enterprise. However, it is unavoidable that a lot of employees send private emails by enterprise email system. It has brought negative effect to email system because the bandwidths are used by personal purpose. What worse, it may delay or affect in sending significant business emails. It may decrease the interests of an enterprise. Moreover, public becomes to take care about privacy. How to classify enterprise emails as either business or personal emails to improve the business interests without monitoring the contents of email. This is the goal of the paper.
To achieve this purpose, only the header of email will be used. The contents in this paper will not. Although it may lower the accuracy of classification. It will protect employee’s private rights. Using the cosine similarity and fuzzy similarity approaches to classify enterprise emails by extracted email header. More important, the incremental system which this paper purposed could effectively avoid handling the huge amount of cumulate emails. And it also considers the change of internal staffs or customers of an enterprise with passing of time.
關鍵字(中) ★ 模糊相似度
★ 餘弦相似度
★ 電子郵件分類關鍵字(英) ★ fuzzy similarity
★ E-mail classification
★ cosine similarity論文目次 摘要...i
Abstract...ii
目錄...iii
圖目錄...v
表目錄...vi
一、 緒論 ...- 1 -
1-1 研究背景...- 1 -
1-2 研究動機與目的...- 3 -
1-3 論文架構...- 4 -
二、 相關研究工作...- 5 -
2-1 資料探勘...- 5 -
2-2 分類技術相關研究...- 6 -
2-2-1. 貝氏分類 (Bayesian Classification)...- 7 -
2-2-2. 決策樹(Decision Tree)...- 8 -
2-2-3. 餘弦相似度(Cosine Similarity)...- 9 -
2-2-4. 模糊理論 (Fuzzy Theory)...- 10 -
2-2-5. 支持向量機 (Support Vector Machines, SVM)...- 11 -
2-3 相關電子郵件之分類...- 12 -
2-4 評估方法...- 12 -
三、 系統設計與方法 ...- 14 -
3-1 系統架構...- 14 -
3-2 問題定義...- 15 -
3-3 系統分類方法...- 16 -
3-3-1. 郵件收件者相似度...- 17 -
3-3-2. 郵件主旨相似度...- 20 -
3-3-3. 郵件分類...- 24 -
3-4 系統漸進式策略...- 25 -
3-4-1. 郵件收件者部份...- 26 -
3-4-2. 郵件主旨部份...- 29 -
3-4-3. 不能分類及收件者和主旨部份判別不一致之郵件...- 31 -
四、 實驗步驟與方法 ...- 33 -
4-1 資料收集與前置處理...- 33 -
4-1-1. 資料來源...- 33 -
4-1-2. 郵件表頭資料之萃取...- 33 -
4-1-3. 公務及私人郵件群組之建立...- 34 -
4-2 實驗相關設定及說明...- 35 -
4-2-1. 系統分類...- 35 -
4-2-2. 系統漸進式策略...- 35 -
4-3 實驗效能評估方法...- 35 -
4-3-1. 召回率 (Recall Rate)...- 36 -
4-3-2. 精密率 (Precision Rate)...- 36 -
4-3-3. 準確率 (Accuracy)...- 36 -
4-3-4. 假陽性率 (False Positive Rate)...- 37 -
4-3-5. F-score...- 37 -
五、 實驗結果與分析 ...- 38 -
5-1 不能分類的郵件分析...- 39 -
5-2 準確率(Accuracy)分析...- 40 -
5-3 召回率(Recall Rate)分析...- 42 -
5-4 精密率(Precision Rate)分析...- 43 -
5-5 假陽性率(False Positive Rate)分析...- 44 -
5-6 F-score分析...- 45 -
5-7 執行時間分析...- 46 -
六、 結論 ...- 47 -
七、 參考文獻...- 48 -
參考文獻 [1] El-Sayed M. El-Alfy and Fares S. Al-Qunaieer, “A Fuzzy Similarity Approach for Automated Spam Filtering”, IEEE/ACS International Conference on Computer Systems and Application, pp.544-550, 2008.
[2] L. H. Gomes, et al.,“Improving Spam Detection Based on Structural Similarity”, In USENIX Workshop on SRUTI, pp.85-91, 2005.
[3] AMA Press Room, “2007 Electronic Monitoring & Surveillance Survey,” American Management Association and The ePolicy Institute, Feb. 2008.
[4] Frawley, W. J., S. G. Paitetsky and C. J. Matheus, “Knowledge Discovery in Databases: An Overview,” Communications of the ACM, Vol. 39, pp.1-34, 1996.
[5] Grupe, F. H. and M. M. Owrang, “Data mining discovering new knowledge and cooperative advantage,” Information Systems Management,12(4), pp. 26-31,1995.
[6] Fayyad, U., G. P. Shapiro and P. Smyth, “From Data Mining to Knowledge Discovery in Database”, AI Magazine, Vol. 17, pp.37-54, 1996.
[7] Michael J. A. Berry, Gordon S. Linoff, “Data Mining Techniques: for marketing, sales, and customer support”, Superpoll.net, Inc Published by arrangment with Weikeg Publishing Co., Ltd., 1997
[8] P. Domingos and M. Pazzani, “Beyond independence: Conditions for the optimality of the simple bayesian classifier, ”in 13th International Conference on Machine Learning(ICML’’96), pp. 105-112, 1996.
[9] P. Taninpong and S. Ngamsuriyaroj, “Incremental Naive Bayesian Spam Mail Filtering and Variant Incremental Training”, Eight IEEE/ACIS International Conference on Computer and Information Science, pp. 383-387, 2009.
[10] R. Kothari and M. Dong, “Decision Trees for Classification: A Review and Some New Results”, World Scientific, 2000.
[11] C. Apte, F. Damerau, and S.M. Weiss, “Automated Learning of Decision Rules for Text Categorization”, in ACM Transactions on Information Systems, 1994.
[12] C. Apte, F. Damerau, and S.M. Weiss, “Text Mining with Decision Trees and Decision Rules”, in Conference on Automated Learning and Discovery, Carnegie-Mellon University, June 1998.
[13] C. Haruechaiyasak, S. Mei-Ling and C. Shu-Ching, “Web Document Classification Based on Fuzzy Association”, 26th Annual International Computer Software and Applications Conference, pp.487-492, 2002
[14] C-Y Tseng and M-S Chen, “Incremental SVM Model for Spam Detection on Dynamic Email Social Networks”, International Conference Computational Science and Engineering, pp.128-135, 2009.
[15] M. N. Marsono, et al., ”Prioritized E-mail Servicing to Reduce Non-Spam Delay and Loss: A Performance Analysis”, International Journal of Network Management, pp. 323-342, 2008.
[16] C-Y Tseng, J-W Huang, and M-S Chen, “ProMail: Using progressive email social network for spam detection,” Proceedings of the Pan-Asia Conference on Knowledge Discovery and Data Mining, pp. 833-840, 2007.
[17] M. Aery and S. Chakravarthy, “eMailSift: Mining-based Approaches To Email Classification”, SIGIR’04, pp. 580-581, 2004.
[18] A. Dasgupta, et al., ”Enhanced Email Spam Filtering through Combining Similarity Graphs”, WSDM’11, pp.785-794, 2011.
指導教授 蔡孟峰(Meng-Feng Tsai) 審核日期 2012-8-7 推文 facebook plurk twitter funp google live udn HD myshare reddit netvibes friend youpush delicious baidu 網路書籤 Google bookmarks del.icio.us hemidemi myshare