博碩士論文 974203033 詳細資訊




以作者查詢圖書館館藏 以作者查詢臺灣博碩士 以作者查詢全國書目 勘誤回報 、線上人數:105 、訪客IP:3.145.62.181
姓名 余東霖(Tung-lin Yu)  查詢紙本館藏   畢業系所 資訊管理學系
論文名稱 以兩階段分類方法識別新聞類別
(Two-phase Classification Approach for Identifying News Category)
相關論文
★ 零售業商業智慧之探討★ 有線電話通話異常偵測系統之建置
★ 資料探勘技術運用於在學成績與學測成果分析 -以高職餐飲管理科為例★ 利用資料採礦技術提昇財富管理效益 -以個案銀行為主
★ 晶圓製造良率模式之評比與分析-以國內某DRAM廠為例★ 商業智慧分析運用於學生成績之研究
★ 運用資料探勘技術建構國小高年級學生學業成就之預測模式★ 應用資料探勘技術建立機車貸款風險評估模式之研究-以A公司為例
★ 績效指標評估研究應用於提升研發設計品質保證★ 基於文字履歷及人格特質應用機械學習改善錄用品質
★ 以關係基因演算法為基礎之一般性架構解決包含限制處理之集合切割問題★ 關聯式資料庫之廣義知識探勘
★ 考量屬性值取得延遲的決策樹建構★ 從序列資料中找尋偏好圖的方法 - 應用於群體排名問題
★ 利用分割式分群演算法找共識群解群體決策問題★ 以新奇的方法有序共識群應用於群體決策問題
檔案 [Endnote RIS 格式]    [Bibtex 格式]    [相關文章]   [文章引用]   [完整記錄]   [館藏目錄]   [檢視]  [下載]
  1. 本電子論文使用權限為同意立即開放。
  2. 已達開放權限電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。
  3. 請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。

摘要(中) 在過去已有許多關於判斷新聞類別的研究,但這些研究僅注重於技術層面,也就是如何在現有的演算法架構之上,發展出更有效率或更正確的演算法,卻忽略了以人的觀點來進行新聞分類,即模仿新聞工作者真正在進行新聞分類的流程。因此,本研究模仿專家在進行新聞分類時的流程來發展演算法。在實際與新聞工作者訪談之後,我們發現專家在進行新聞分類時的流程大致上可分為兩個步驟;首先,快速瀏覽新聞文章,找尋具代表性或能協助他們進行分類的關鍵字。其次,若找到的關鍵字無法協助他們進行分類,或關鍵字在新聞類別內的代表性不足,則進一步仔細檢視整篇新聞內容。
  模仿並依循著我們所觀察到的專家知識與分類流程,本研究將新聞分類演算分為兩步驟;在訓練階段,首先,本研究使用「分類關聯規則」找出各個類別的代表性關鍵字,其次,每個類別底下再使用「分群」方法產生子類別。在測試階段,首先利用分類關聯規則找尋符合的分類規則,若規則的信心水準度不足,則進一步比對新聞和子類別的相似度,找出最合適的新聞類別。實驗顯示本研究所提出的專家導向方法相較於傳統技術導向方法,擁有更好且更穩定的分類正確率。
摘要(英) The news classification problem is concerned with how to assign the correct category for the unclassified news. Although a large number of past studies have studied this problem, a common weakness of these studied is that their classification algorithms were usually designed from technical perspective and they seldom considered how experts really classify the news in a practical classification process. In this research, we first observe how media workers classify news in their daily operations, and we find that their classification process mainly consists of the following operations. (1) If some important keywords or phrases are present in the news, then they directly assign the news to certain categories. (2) Otherwise, they must check in details the whole content of news to determine which category it should belong to. (3) Since a news category may contain several independent but related subcategories, the news is usually classified by assigning it to the most appropriate subcategory, which can in turn determine its category.
  By imitating the above working process, we proposed a news classification algorithm. In the learning phase, we use associative classification rules to find representative keywords in each category. In addition, we further generate a number of subcategories by clustering news under each category. In the classification phase, we assign unclassified news the most appropriate category by using associative classification rules if rules’ confidence is high enough. Otherwise, we will determine the category by measuring the similarity between unclassified news and subcategories. The experimental comparison shows that our approach has better and more stable classification performance than traditional algorithms.
關鍵字(中) ★ 分群
★ 分類關聯規則
★ 文字探勘
★ 新聞分類
關鍵字(英) ★ Text Mining
★ News Classification
★ Clustering
★ Associative Classification Rule
論文目次 Abstract i
摘要 ii
誌謝辭 iii
Table of Contents iv
List of Figures v
List of Tables vi
Chapter 1 Introduction 1
  1.1 Background 1
  1.2 Motivation 2
  1.3 The idea of our approach 4
  1.4 Objective 6
Chapter 2 Literature Review 7
  2.1 News (document) classification 7
  2.2 Text preprocessing 8
  2.3 Algorithm selection 9
  2.4 Our work 13
Chapter 3 Algorithm 14
  3.1 Sketch of the proposed approach 14
  3.2 Symbol definition 17
  3.3 Batch process 20
  3.4 Online process 28
Chapter 4 Performance Evaluation 36
  4.1 Data collections 36
  4.2 Measurements 37
  4.3 Control variables optimization 38
  4.4 Experimental comparison 47
Chapter 5 Conclusion 53
  5.1 Contribution 53
  5.2 Future work 53
Reference 55
參考文獻 [1]. S. Tong and D. Koller, “Support vector machine active learning with applications to text classification”, Proceedings of the 17th International Conference on Machine Learning, pp. 401-412, 2000.
[2]. R.C. Chen and C.H. Hsieh, “Web page classification based on a support vector machine using a weighted vote schema”, Expert Systems with Applications 31 (2), 2006.
[3]. L. Cai and T. Hofmann, “Hierarchical document categorization with support vector machines,” ACM 13th Conference on Information and Knowledge Management, pp. 1-10, 2004.
[4]. Cheng Hua Li and Soon Choel Park, “An efficient document classification model using an improved back propagation neural network and singular value decomposition”, Expert systems with applications, 36, pp. 3208-3215, 2009.
[5]. L.Manevitz and M.Yousef, “One-class document classification via neural networks”, Neuro computing, pp. 1466-1481, 2007.
[6]. R. Anand, K. G. Mehrotta, C. K. Mohan, and S. Ranka, “An improved algorithm for neural network classification of imbalanced training sets”, IEEE Trans. Neural Networks, vol. 4, pp. 962-969, 1993.
[7]. Wenyuan Dai, Gui-Rong Xue, Qiang Yang, and Yong Yu, ”Transferring naive bayes classifiers fort ext classification”, Proceedings of the 22nd AAAI Conference on Artificial Intelligence, pp. 540-545, 2007.
[8]. Sang-Bum Kim, Kyoung-Soo Han, Hae-Chang Rim, and Sung Hyon Myaeng, “Some effective techniques for naive Bayes text classification”, IEEE Transactions on Knowledge and Data Engineering, 18(11):1457-1466, 2006.
[9]. A. Juan and E. Vidal, “On the use of Bernoulli mixture models for text classification”, Pattern Recognition, 35(12): 2705-2710, 2002.
[10]. K. Nigam, A.K. McCallum, S. Thrun, and T.M. Mitchell, “Text Classification from Labeled and Unlabeled Documents Using EM”, Machine Learning, vol. 39, nos. 2/3, pp. 103-134, 2000.
[11]. K.M. Schneider, “A New Feature Selection Score for Multinomial Naive Bayes Text Classification Based on KL-Divergence”, 42nd Meeting of the Association for Computational Linguistics, pp. 186-189, 2004.
[12]. D. Isa, L. H. Lee, V. P. Kallimani, and R. RajKumar, “Text Document Preprocessing with the Bayes Formula for Classification Using the Support Vector Machine”, IEEE Transactions on Knowledge and Data Engineering, vol. 20, 2008.
[13]. L. Denoyer and P. Gallinari, “Bayesian Network Model For Semi-Structured Document Classification”, In Information Processing and Management, Volume 40, Issue 5, pp. 807-827, 2004.
[14]. B. C. M. Fung, K. Wang, and M. Ester, “Hierarchical Document Clustering Using Frequent Itemsets”, Proc. of SIAM Int’l Conf. on Data Mining, 2003
[15]. B Yang, JT Sun, T Wang, and Z Chen, “Effective multi-label active learning for text classification”, In KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 917-926, 2009.
[16]. J. Rousu, C. Saunders, S. Szedmak, and J. Shawe-Taylor, “Kernel-Based Learning of Hierarchical Multilabel Classification Models”, The Journal of Machine Learning Research, pp. 1601-1626, 2006.
[17]. C. Vens, J. Struyf, L. Schietgat, S. Dzeroski, and H. Blockeel, “Decision trees for hierarchical multi-label classification”, Machine Learning, vol. 73, pp. 185-214, 2008.
[18]. RA Calvo, “Classifying Financial News With Neural Networks”, Proc. of the 6th Australasian Document Computing Symposium, 2001.
[19]. W Zheng, E Milios, and C Watters, “Filtering for medical news items using a machine learning approach”, AMIA Annual Symposium Proceedings, pp. 949-53, 2002.
[20]. G. Forman, “Choose your words carefully: An Empirical Study of Feature Selection Metrics for Text Classification”, Proceedings of the 6th Eur. Conf. on Principles Data Mining and Knowledge Discovery (PKDD), vol. 2431, pp. 150-162, 2002.
[21]. Y. Yang and J.O. Pedersen, “A Comparative Study on Feature Selection in Text Categorization”, Proc. of the 14th International Conference on Machine Learning ICML97, pp. 412-420, 1997.
[22]. M. Ikonomakis, S. Kotsiantis, and V. Tampakas, “Text Classification Using Machine Learning Techniques”, WSEAS Transactions on Computers, Issue 8, Volume 4, pp. 966-974, 2005.
[23]. S. Tan, “Neighbor-weighted k-nearest neighbor for unbalanced text corpus”, Expert Syst. Appl., vol. 28, pp. 667-671, 2005.
[24]. E.H. Han, G. Karypis, and V. Kumar, “Text categorization using weight adjusted k-nearest neighbor classification”, In Proceeding of the fifth pacific-asia conference on advances in knowledge discovery and data mining (PAKDD01), pp. 53-65, 2001.
[25]. C. Cortes and V. Vapnik, “Support-vector networks”, Machine Learning, pp. 273-297, 1995.
[26]. M. Dorigo, V. Maniezzo, and A. Colorni, “The ant system: optimization by a colony of cooperating agents”, IEEE Trans. Systems Man Cybernet, pp. 29-41, 1996.
[27]. T. Joachims, “Text categorization with support vector machines”, European Conference on Machine Learning (ECML), 1998.
[28]. S. Chakrabarti, S. Roy, and M. V. Soundalgekar, “Fast and accurate text classification via multiple linear discriminant projections”, The VLDB Journal, pp. 170-185, 2003.
[29]. L.Manevitz and M.Yousef, “One-class document classification via neural networks”, Neural computing, pp.1466-1481, 2007.
[30]. S.R. Safavian and D. Landgrebe, “A Survey of Decision Tree Classifier Methodology”, IEEE Trans. Systems, Man, and Cybernetics, vol. 21, no. 3, pp. 660-674, 1991.
[31]. D. Koller and M. Sahami, “Hierarchically classifying documents using very few words”, Proc. of the 14th Int’l Conf. on Machine Learning, pp. 170-178, 1997.
[32]. N. Cesa-Bianchi, C. Gentile, and L. Zaniboni, “Incremental algorithms for hierarchical classification”, J. Mach. Learn. Res., pp. 31-54, 2006.
[33]. R. Prabowo, M. Jackson, P. Burden, and H. Knoell, “Ontology-Based Automatic Classification for the WEB Pages: Design, Implementation an Evaluation”, Proc. of 3rd International Conference, pp. 182-191, 2002.
[34]. M.H. Song, S.Y. Lim, D.J. Kang, and S.J. Lee, “Automatic Classification of Web Pages based on the Concept of Domain Ontology”, Proceedings of the 12th Asia-Pacific Software Engineering Conference(APSEC’05), 2005.
[35]. M. Grobelnik and D. Mladenik, “Simple classification into large topic ontology of Web documents”, In Proceedings: 27th International Conference on Information Technology Interfaces, pp. 20-24, 2005.
[36]. C. Haruechaiyasak, M.-L. Shyu, S.-C. Chen, and X. Li, “Web document classification based on fuzzy association”, in: Proc. of the 26th IEEE Int. Computer Software and Applications Conf., pp. 487-492, 2002.
[37]. H. Ishibuchi, T. Nakashima, and T. Murata, “A fuzzy classifier system that generates fuzzy if-then rules for pattern classification problem”, l'cuc. 211d IEEE hit. Conf. on Evolutionary Computation, pp. 759-764, 1995.
[38]. R. Kondadadi and R. Kozma, “A modified fuzzy ART for soft document clustering”, Proceedings of the 2002 International Joint Conference on Neural Networks, IJCNN '02, vol. 3, pp. 2545- 2549, 2002.
[39]. T.Y. Wang and H.M. Chiang, “Fuzzy support vector machine for multi-class text categorization”, Information Processing and Management, 43(4), pp. 914-929, 2007.
[40]. D. Merkl, “Text classification with self-organizing maps: Some lessons learned”, Neural computing, 21, pp. 61-77, 1998.
[41]. D.G Roussinov and H. Chen, “A scalable self-organizing map algorithm for textual classification: a neural network approach to thesaurus generation”, The Journal for the Integrated Study of Artificial Intelligence, Cognitive Science and Applied Epistemology, pp. 81-111, 1998.
[42]. Kenji Hatano, Ryouichi Sano, Yiwei Duan, and Katsumi Tanaka, “An interactive classification of web documents by self-organizing maps and search engines”, Proceedings of the Sixth International Conference on Database Systems for Advanced Applications (DASFAA), pp. 35-42, 1999.
[43]. R. Jones, A. McCallum, K. Nigam, and E. Riloff, “Bootstrapping for text learning tasks”, In IJCAI-99 Workshop on Text Mining: Foundations, Techniques and Applications, 1999.
[44]. B. Liu, X. Li, W. S. Lee, and P. S. Yu, “Text classifcation by labeling words”, In AAAI-04, 2004.
[45]. Y. Bao and N. Ishii, “Combining multiple k-Nearest Neighbor Classifiers for Text Classification by Reducts”, Proc.5th International Conference on Discovery Science, pp. 361-368, 2002.
[46]. H. Gunes Kayacik, A.N. Zincir-Heywood, and M.I. Heywood, “A hierarchal SOM-based intrusion detection system”, Engineering Applications of Artificial Intelligence, vol. 20, no. 4, pp. 439-451, 2007.
[47]. M. Dittenbach, A. Rauber, and D. Merkl, “Uncovering hierarchical structure in data using the growing hierarchical self-organizing map”, Neural computing, 48, pp. 199-216, 2002.
[48]. B. Fritzke, “Growing grid – a self-organizing network with constant neighborhood range and adaptation strength”, Neural Processing Letters, 2(5), pp. 913, 1995.
[49]. J. S. Rodrigues and L. B Almeida, “Improving the learning speed in topological maps of patterns”, Proceedings of INNC, pp. 813-816, 1990.
[50]. C. L. Castro, M. A. Carvalho, and A. P. Braga, “An Improved Algorithm for SVMs Classification of Imbalanced Data Sets”, Engineering Applications of Neural Networks, pp. 108-118, 2009.
[51]. C. E. Shannon and W. Weaver, “The Mathemtiatical Theory of Communication”, Urbana, University of Illinois Press, 1949.
[52]. R. B. Calinski and J. Harabasz, “A dendrite method for cluster analysis”, Communications in Statistics 3, pp. 1-27, 1974.
[53]. Van Rijsbergen and C. J., “Information Retrieval (second edition)”, Butterworths, London, 1979.
[54]. I. Biskri and S. Delisle, “Text Classification and Multilinguism: Getting at Words via Ngrams of Characters”, 6th World Multiconference on Systemics, pp. 110-115, 2002.
[55]. H. H. Chen and C. J. Lin, “A multilingual news summarizer”, Proceedings of 18th International Conference on Computational Linguistics, pp. 159-165, 2000.
[56]. D. K. Evans and J. L. Klavans, “A Platform for Multilingual News Summarization”, Technical Report, Department of Computer Science, Columbia University, 2003.
[57]. B. Pouliquen, R. Steinberger, C. Ignat, E. Käsper, and I. Temnikova, “Multilingual and cross-lingual news topic tracking”, Proceedings of the 20th International Conference on Computational Linguistics, Geneva, Switzerland, 2004.
[58]. C.P. Wei, C.C. Yang, and C. M. Lin, “A Latent Semantic Indexing-based approach to multilingual document clustering”, Decision Support Systems, 45(3): pp. 606-620, 2008.
[59]. B. C. M. Fung, K. Wang, and M. Ester, “Hierarchical document clustering using frequent itemsets”, Proc. of the SIAM International Conference on Data Mining, 2003.
[60]. D. Barbar, C. Domeniconi, and N. Kang, “Classifying Documents Without Labels”, Proceedings of the Fourth SIAM International Conference on Data Mining, 2004.
[61]. Y. Li, S. M. Chung, and J. D. Holt, “Text document clustering based on frequent word meaning sequences”, Data and Knowledge Engineering, 64, pp. 381-404, 2008.
[62]. F. Hayes-Roth, D. Waterman, and D. Lenat, “Building Expert Systems”, New York: Addison-Wesley, 1983.
[63]. G. J. Klir and B. Yuan, “Fuzzy Sets and Fuzzy Logic: Theory and Applications”, Prentice Hall, New Jersey, 1995.
[64]. D. E. Goldberg, “Genetic Algorithms in Search, Optimization and Machine Learning”, Addison-Wesley, Reading, MA, 1989.
[65]. C. Peterson and B. Södeberg, “Artificial Neural Networks”, Modern heuristic techniques for combinatorial problems, Advanced Topics in Computer Science, Oxford Scientific Publications, pp. 197-242, 1993.
[66]. J. Kennedy and R. Eberhart, “Particle swarm optimization”, Proc. IEEE International Conf. on Neural Networks, 1995.
[67]. A. Hotho, A. Nurnberger, and G. Paab, ”A Brief Survey of Text Mining”, GLDV-Journal for Computational Linguistics and Language Technology, 20(2): pp. 19-62, 2005.
[68]. G. Salton, C. Yang, and A. Wong, “A vector space model for automatic indexing”, Communications of the ACM, 18(11), pp. 613-620, 1975.
[69]. R. E. Fan, P. H. Chen, and C. J. Lin, “Working set selection using the second order information for training SVM”, Journal of Machine Learning Research 6, pp. 1889-1918, 2005.
[70]. Chih-Jen Lin's Home Page. From: http://www.csie.ntu.edu.tw/~cjlin/libsvm/index.html.
指導教授 陳彥良(Yen-liang Chen) 審核日期 2010-7-14
推文 facebook   plurk   twitter   funp   google   live   udn   HD   myshare   reddit   netvibes   friend   youpush   delicious   baidu   
網路書籤 Google bookmarks   del.icio.us   hemidemi   myshare   

若有論文相關問題,請聯絡國立中央大學圖書館推廣服務組 TEL:(03)422-7151轉57407,或E-mail聯絡  - 隱私權政策聲明