多重標籤文本分類之實證研究 : word embedding 與傳統技術之比較

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：12

、訪客IP：3.149.249.124

姓名

饒以恩(YI-EN Rau) 查詢紙本館藏

畢業系所

資訊管理學系在職專班

論文名稱

多重標籤文本分類之實證研究 : word embedding 與傳統技術之比較
(An empirical study of multi-label text classification: word2vector vs traditional techniques)

相關論文

★ 基於圖神經網路之網路協定關聯分析	★ 學習模態間及模態內之共用表示式
★ Hierarchical Classification and Regression with Feature Selection	★ 病徵應用於病患自撰日誌之情緒分析
★ 基於注意力機制的開放式對話系統	★ 針對特定領域任務—基於常識的BERT模型之應用
★ 基於社群媒體使用者之硬體設備差異分析文本情緒強烈程度	★ 機器學習與特徵工程用於虛擬貨幣異常交易監控之成效討論
★ 捷運轉轍器應用長短期記憶網路與機器學習實現最佳維保時間提醒	★ 基於半監督式學習的網路流量分類
★ ERP日誌分析-以A公司為例	★ 企業資訊安全防護：網路封包蒐集分析與網路行為之探索性研究
★ 資料探勘技術在顧客關係管理之應用─以C銀行數位存款為例	★ 人臉圖片生成與增益之可用性與效率探討分析
★ 人工合成文本之資料增益於不平衡文字分類問題	★ 探討使用多面向方法在文字不平衡資料集之分類問題影響

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 ( 永不開放)

摘要(中)

網路的發展帶動社交媒體突飛猛進。因為社交媒體平台言論自由會造成濫用，像是網路騷擾或惡意評論等等……機器學習的最新進展也已改變了許多領域，電腦視覺、語音辨識和語言處理，本研究想使用機器學習的文本分類來有效地過濾出惡意評論。本研究使用的資料集是來自於Kaggle舉辦的競賽: Toxic Comment Classification Challenge，其資料來源為維基百科之評論，這些評論已被人類評估者標記為惡意且帶有毒性。學生運用機器學習(Machine Learning，ML)的方式搭配不同的向量表示法來進行數據的分析比較與預測。

本研究中的向量表示法會採用TF-IDF與 Word2Vec兩種方式，且以K-近鄰演算法、支持向量機、人工神經網路、深度學習進行文本的分類。因資料集含有六種多重標籤: toxic、severe_toxic、obscene、threat、insult、identity_hate，故會針對此六種標籤各搭配不同的向量表示法及分類器比較分析。

實驗結果表示在辨識惡意評論中，精準率(Precision)部分，TF-IDF搭配SVM分類器為本論文最佳組合；而召回率(Recall)部分，則以Word2vec搭配LSTM分類器為本論文最佳組合。

摘要(英)

The development of the Internet has led to the rapid advancement of social media. Because the free speech and anonymity of social media characteristic, it causes abuse such as cyber harassment and Toxic Comments. Machine learning have changed many fields, for example computer vision, speech recognition and language processing. I will use the text classification of machine learning to effectively filter out Toxic Comments. The dataset is from the competition organized by Kaggle: Toxic Comment Classification Challenge, whose source is Wikipedia′s comments. These comments have been flagged as malicious and toxic by human evaluators. I will use Machine Learning (ML) method to match different Document representations for data analysis and prediction.

In this study, the Document representations of the text will use TF-IDF and Word2Vec for comparison and use KNN, SVM, ANN, Deep Learning as text classifier. This data set contains six multi-labels: toxic, severe_toxic, obscene, threat, insult, identity_hate, so the six labels are paired with different Document representations and text classifiers for comparative analysis.

The results show that in the Precision section, there is best predictive performance in TF-IDF combined with the SVM classifier than Word2Vec. About the Recall section, there is best predictive performance in Word2vec combined with LSTM classifiers.

關鍵字(中)

★ 文本分類
★ 詞向量
★ 機器學習
★ Word2Vec
★ 惡意評論

關鍵字(英)

★ text classification
★ Document representations
★ machine learning
★ toxic comments

論文目次

摘要 I
Abstract II
目錄 III
圖目錄 V
表目錄 VII
1. 緒論 1
1.1 研究背景 1
1.2 研究動機 1
1.3 研究目的 2
1.4 論文架構 4
2. 文獻探討 5
2.1 Document representation 5
2.1.1 Bag-of-Word model (BoW model) 5
2.1.2 Term Frequency-Inverse Document Frequency（TF-IDF） 6
2.1.3 Word Embedding 7
2.2 分類器介紹 9
2.2.1 SVM (Support Vector Machine) 9
2.2.2 KNN (K-Nearest Neighbor Classification) 10
2.2.3 ANN (Artificial Neural Network) 11
2.2.4 LSTM (Long Short-Term Memory) 13
2.2.5 成效評估 15
2.2.6 文本分類之相關研究 16
3. 實驗方法 18
3.1 資料集介紹 19
3.2 方法及流程 21
3.2.1 資料前處理(Preprocessing) 21
3.2.2 詞向量(Word Representation)生成 23
3.2.3 分類器 24
3.3 實驗 : 最佳向量表示法和分類器之組合 26
4. 結果與分析 28
4.1 整體分析 28
4.2 標籤各別分析 38
5. 結論 50
5.1 結論 50
5.2 實驗貢獻 51
5.3 未來展望 51
參考文獻 52

參考文獻

Basheer, I. A., and Hajmeer, M. (2000). “Artificial neural networks:Fundamentals, computing, design, and application.” Journal of Microbiological Methods, 43(1), pp. 3–31.
Cortes, C. and Vapnik, V. (1995). “Support-Vector Networks.” Machine Learning, 20(3), pp. 273–297.
Drucker, H., Wu, D., and Vapnik, V.N. (1999). “Support Vector Machines for Spam cate- gorization.” IEEE Transactions on Neural Networks, 10(5), pp. 1048–1054.
Enrquez, F., Troyano, J.A., Lpez-Solaz, T. (2016). “An approach to the Use of Word Embeddings in an Opinion Classification Task.” Expert Systems with Applications, 66(12), pp. 1–6.
Fürnkranz, J. (1998). “A Study Using N-Gram Features for Text Categorization.” Austrian Research Institute for Artifical Intelligence, 3(1998), pp. 1–10.
Greff, K., Srivastava, R. K., Koutn´ık, J., Steunebrink, B. R., and Schmidhuber, J. (2015). “LSTM: A Search Space Odyssey.” CoRR, abs/1503.04069.
Guggilla, C., Miller, T.,and Gurevych, I. (2016) “CNN-and LSTM-based claim classification in online user comments.” In Proceedings of the 26th International Conference on Computational Linguistics: Technical Papers (COLING 2016), pp. 2740–2751.
Hinton, G. E. (1986). “Learning distributed representations of concepts.” In Proceedings of the eighth annual conference of the cognitive science society, pp. 1–12.
Hochreiter, S., and Schmidhuber, J. (1997). “Long short-term memory,” Neural computation, 9(8), pp. 1735–1780.
Ikonomakis, M., Kotsiantis, S., and Tampakas, V. (2005). “Text Classification Using Machine Learning Techniques.” WSEAS Transactions on Computers, 4(8), pp. 966–974.
Joachims, T. (1998.) “Text Categorization with Support Vector Machines: Learning with Many Relevant Features.” In Proceedings of the European Conference on Machine Learning (ECML), pp. 137–142.
Lilleberg, J., Zhu, Y., and Zhang, Y. (2015). “Support Vector Machines and Word2vec for Text Classification with Semantic Features.” In 2015 IEEE 14th International Conference on Cognitive Informatics & Cognitive Computing (ICCI*CC), pp. 136–140.
Medlock, B. (2003). “A Language Model Approach to Spam Filtering.” http://www.benmedlock.co.uk/medlock-03.pdf [accessed on Apr. 1, 2008], 7 pages.
Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). “Efficient Estimation of Word Representations in Vector Space.” CoRR, abs/1301.3781.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., and Dean, J. (2013). “Distributed Representations of Words and Phrases and Their Compositionality.” In NIPS, pp. 3111–3119.
Mikolov, T., Deoras, A. Povey, D., Burget, L., and Cernocky, J. (2011). “Strategies for Training Large Scale Neural Network Language Models.” In Proceedings of Automatic Speech Recognition and Understanding(ASRU), pp. 196–201.
Masand, B., Linoff, G., and Waltz, D. (1992). “Classifying news stories using memorybased reasoning.” In Proceedings of SIGIR-92, 15th ACM International Conference on Research and Development in Information Retrieval (Kobenhavn, DK, 1992), pp. 59–65.
Olah, C. (2015). “Understanding LSTM Networks.”, colah′ blog, 27 August. Available at
https://colah.github.io/posts/2015-08-Understanding-LSTMs. [Accessed 25 Apr. 2019].
Pennington, J., Socher, R., and Manning, C.D. (2014). “Glove: Global vectors for word representation,” In Proceedings of the Empirical Methods in Natural Language Processing, pp. 1532–1543.
Pradhan, L., Taneja, N.A., Dixit, C., and Suhag, M. (2017) “Comparison of Text Classifiers on News Articles.” Int. Res. J. Eng. Technol., 4(3), pp. 2513–2517.
Salton, G., and Buckley, C. (1988). “Term weighting approaches in automatic text retrieval.” Information Processing and Management, 24(5), pp. 513-523.
Sebastiani, F. (2002). “Machine learning in automated text categorization.” ACM Computing Surveys, 34(1), pp. 1−47.
Sak, H., Senior, A., and Beaufays, F. (2014). “Long short-term memory recurrent neural network architectures for large scale acoustic modeling.” In Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH).
Shen, D., Sun, J., Yang, Q. and Chen, Z. (2006). “Text Classification Improved Through Multigram Models.” In Proceedings of the 15th ACM International Conference on Information and Knowledge Management, pp. 672–681.
Su, Z., Xu, H., Zhang, D., and Xu, Y. (2014). “Chinese sentiment classification using a neural network tool- Word2vec” In 2014 International Conference on Multisensor Fusion and Information Integration for Intelligent Systems (MFI), pp. 1–6.
Sundermeyer, M., Schluter, R., and Ney, H. (2010). “Lstm neural networks for language modeling.” In INTERSPEECH.
Spärck Jones, K. (1972). “A statistical interpretation of term specificity and its application
in retrieval.” Journal of Documentation, 28 (1), pp. 11–21.
van Aken, B., Risch, J., Krestel, R., L¨oser, A. (2018). “Challenges for toxic comment classification: An in-depth error analysis.” In Proceedings of the Workshop on Abusive Language Online (ALW@EMNLP), pp. 33–42.
Weinberger, K.Q., Blitzer, J., and Saul, L.K. (2006). “Distance metric learning for large margin nearest neighbor classification.” In Advances NIPS.
Zhang, D., Xu, H., Su, Z., and Xu, Y. (2015). “Chinese Comments Sentiment Classification Based on Word2vec and SVMperf.” Expert Systems with Applications, 42(4), pp. 1857–1863.
Zhu, Z., Zhang, W., Li, G-Z., He, C.,and Zhang, L. (2016) "A study of damp-heat syndrome classification using Word2vec and TF-IDF." In Proceedings of 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 15-18.

指導教授

柯士文(Shih-Wen Ke)

審核日期

2019-8-20

推文