以word2vec擴展關鍵字詞應用於商品名稱自動化分類

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：19

、訪客IP：3.17.150.163

姓名

石秀?(Hsiu-Ying Shih) 查詢紙本館藏

畢業系所

資訊管理學系

論文名稱

以word2vec擴展關鍵字詞應用於商品名稱自動化分類

相關論文

★ 網路合作式協同教學設計平台－以國中九年一貫課程為例	★ 內容管理機制於常用問答集(FAQ)之應用
★ 行動多重代理人技術於排課系統之應用	★ 存取控制機制與國內資安規範之研究
★ 信用卡系統導入NFC手機交易機制探討	★ App應用在電子商務的推薦服務-以P公司為例
★ 建置服務導向系統改善生產之流程-以W公司PMS系統為例	★ NFC行動支付之TSM平台規劃與導入
★ 關鍵字行銷在半導體通路商運用-以G公司為例	★ 探討國內田徑競賽資訊系統－以103年全國大專田徑公開賽資訊系統為例
★ 航空地勤機坪作業盤櫃追蹤管理系統導入成效評估—以F公司為例	★ 導入資訊安全管理制度之資安管理成熟度研究－以B個案公司為例
★ 資料探勘技術在電影推薦上的應用研究-以F線上影音平台為例	★ BI視覺化工具運用於資安日誌分析—以S公司為例
★ 特權帳號登入行為即時分析系統之實證研究	★ 郵件系統異常使用行為偵測與處理-以T公司為例

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

隨著網際網路及資訊技術普及應用，人們經由網路所獲得資訊的時間越來越短，但資訊量急劇增加，過量的資訊逐漸形成資訊爆炸的問題，各個企業與機構的數位文件也不斷快速累積，數量大到難以有效的管理與利用，文件分類 (Text Classification) 因應而生，利用自動化的技術協助人工分類，來應付大量暴增的分類需求。傳統文件分類的做法是以人工方式進行，近年來，深度學習 (Deep Learning) 已被廣泛討論並應用在多種研究上，許多文獻顯示利用深度學習的技術，可以幫助結果更加完善或增進效能。
本研究利用消費者於實體店面購買的商品名稱資料，透過新興的深度學習技術，應用 word2vec 詞向量模型於文件自動化分類，藉由其自行學習語義間關係的技術，將商品自動分類到正確的類別，並透過四個實驗探討不同的因素下訓練出的 word2vec 詞向量模型，會影響其成效，最後也證實以 word2vec 擴展關鍵字詞能提高分類成效。

摘要(英)

With the popularization of Internet and technology, people get more information through the Internet, but the amount of information has increased dramatically. Excessive information has gradually formed the problem of information explosion. Digital documents of various enterprises and organizations are also constantly increasing. The amount of digital documents is large to be difficult to manage and utilize effectively. Text Classification is created in response to deal with the massive surge in classification needs. Traditional text classification is done manually. In recent years, Deep Learning has been widely discussed and applied in a variety of studies. Many literatures show that deep learning techniques can help improve results or improve performance.
This study uses the data of product names purchased by consumers in physical store to apply the word2vec word embedding model to the automatic classification of documents through the deep learning technology. By self-learning the semantic relationship, the products are automatically classified into the correct category. And through a number of experiments to explore the word2vec word embedding model trained under different factors, will affect its effectiveness. Finally, this study confirmed that applied word2vec to expand keywords can improve the effect of classification.

關鍵字(中)

★ 文件分類
★ 深度學習
★ 詞向量
★ 自動化分類

關鍵字(英)

★ word2vec
★ Text Classification
★ Word Embedding
★ Deep Learning
★ Automatic Classification

論文目次

摘要···········································································································i Abstract······································································································ii
誌謝········································································································· iii
目錄········································································································· iv
圖目錄····································································································· vii
表目錄···································································································· viii
一、前言································································································· 1
1-1 研究背景 ························································································ 1
1-2 研究動機 ························································································ 2
1-3 研究目的 ························································································ 2
二、相關研究··························································································· 4
2-1 特徵字詞擷取 ··················································································· 4
2-2 特徵字詞權重計算 ············································································ 5
2-2-1 字詞頻率 (Term Frequency, TF)···················································· 6
2-2-2 文件頻率倒數 (Inverse Document Frequency, IDF)····························· 6
2-2-3 TF-IDF ··················································································· 7
2-3 詞向量 ···························································································· 7
2-4 word2vec 詞向量模型 ········································································· 8
2-5 文件分類 ······················································································ 12
三、研究方法························································································· 14
3-1 研究架構 ······················································································ 14
3-2 資料前處理 ··················································································· 15
3-3 產生關鍵字詞 ················································································ 18
3-3-1 訓練 word2vec 模型 ································································ 18
3-3-2 計算外部參考關鍵字詞權重 ····················································· 19
3-3-3 特徵字詞選取········································································ 20
3-3-4 擴展關鍵字詞········································································ 21
3-4 文件分類 ······················································································ 22
3-4-1 商品名稱比對關鍵字詞···························································· 22
3-4-2 人工萃取關鍵字詞·································································· 23
3-5 評估 ···························································································· 23 四、實驗與結果······················································································ 26
4-1 資料集 ························································································· 26
4-2 實驗設計 ······················································································ 26
4-2-1 實驗一:訓練 word2vec 詞向量模型維度之影響···························· 26
4-2-2 實驗二:word2vec 詞向量模型架構之影響··································· 27
4-2-3 實驗三:word2vec 詞向量模型訓練集語料庫之影響······················· 29
4-2-4 實驗四:使用 word2vec 詞向量模型擴展關鍵字詞之影響················ 30
五、結論與未來研究方向·········································································· 32
5-1 結論 ···························································································· 32
5-2 研究限制 ······················································································ 33
5-3 未來研究方向 ················································································ 34
參考文獻·································································································· 35

參考文獻

英文文獻
[1] Lai, S., Liu, K., He, S., & Zhao, J. (2016). How to generate a good word embedding. IEEE Intelligent Systems, 31(6), 5-14.
[2] Lu, S. H., Chiang, D. A., Keh, H. C., & Huang, H. H. (2010). Chinese text classification by the Nai?ve Bayes Classifier and the associative classifier with multiple confidence threshold values. Knowledge-based systems, 23(6), 598-604.
[3] Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
[4] Sebastiani, F. (2002). Machine learning in automated text categorization. ACM computing surveys (CSUR), 34(1), 1-47.
[5] Severyn, A., & Moschitti, A. (2015, August). Twitter sentiment analysis with deep convolutional neural networks. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 959-962). ACM.
[6] Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
[7] Su, J. S., Zhang, B. F., & Xu, X. (2006). Advances in machine learning based text categorization. Ruan Jian Xue Bao(Journal of Software), 17(9), 1848-1859.
[8] Yuan, Z., Lu, Y., Wang, Z., & Xue, Y. (2014, August). Droid-sec: deep learning in android malware detection. In ACM SIGCOMM Computer Communication Review (Vol. 44, No. 4, pp. 371-372). ACM.

中文文獻
[1] 杜海倫. (1999). 以標題進行新聞自動分類. 碩士論文, 國立清華大學.
[2] 高志強. (2004). 組合自動化文件分類技術之研究-以專利文件分類為例. 中原大學
資訊管理研究所學位論文, 1-102.
[3] 胡雅涵, 黃正魁, & 楊承翰. (2014). 以基因演算法為基礎建立自動化文件分類模式.
資訊管理學報, 21(3), 305-339.
[4] 許雅芬. (2002). 新聞文件自動分類之研究. 碩士論文, 東吳大學.
[5] 曾元顯. (2002). 文件主題自動分類成效因素探討. 中國圖書館學會會報, Vol.68, 62-83.
[6] 黃純敏, 陳聰宜, & 詹雅筑. (2014). 新聞事件偵測與追蹤之分群分類演算法研究. 資訊科技國際期刊, 8(1), i1-9.
[7] 黃嘉宏. (2008). 基於自動分類為基礎的圖書題名特徵擷取之研究-以輔助圖書分類系統為例. 碩士論文, 輔仁大學.
[8] 劉超瑞. (2013). 應用多項式簡易貝氏分類器於文件分類的推導廣義狄氏分配參數之方法. 成功大學資訊管理研究所學位論文, 1-59.
[9] 劉瑋竣. (2014). 使用 WordNet 語意之拍賣商品標題自動分類. 碩士論文, 國立屏東商業技術學院.
[10] 簡俊銘. (2014). 新聞標題自動分類之研究. 碩士論文, 華梵大學.

網路文獻
[1] Fukuball(西元 2014 年)。 JIEBA 結巴中文斷詞。西元 2018 年 2 月，取自: https://speakerdeck.com/fukuball/jieba-jie-ba-zhong-wen-duan-ci
[2] Gensim(n. d.). Retrieved March 2018, from https://radimrehurek.com/gensim/index.html
[3] Jieba(n. d.). Retrieved February 2018, from https://github.com/fxsjy/jieba
[4] MBA 智庫百科(無日期)。信息爆炸。西元 2018 年 2 月，取自: http://wiki.mbalib.com/zh-tw/信息爆炸
[5] 中央研究院中文斷詞系統(無日期)。西元 2018 年 2 月，取自: http://ckipsvr.iis.sinica.edu.tw/
[6] 資策會產業情報研究所(Market Intelligence & Consulting Institute, MIC) (西元 2018 年 3 月 15 日)。【網購大調查系列一】日常購物頻率網購已達 45%。西元 2018 年 4 月，取自:https://mic.iii.org.tw/IndustryObservations_PressRelease02.aspx?sqno=488
[7] 維基百科。tf-idf。西元 2018 年 1 月，取自:https://zh.wikipedia.org/wiki/Tf-idf

指導教授

林熙禎(Shi-Jen Lin)

審核日期

2018-7-27

推文