劉鴻儀(Hung-Yi Liu)
論文名稱 利用文字探勘技術分析評論特徵因子對於體驗品評論有益性之影響:以IMDb 為例
(Using text mining technology to analyze the influence characteristic factors on the helpfulness of reviews on experience goods: An example of IMDb)
摘要(中) 在線評論已經變成消費者重要的購買參考決策,它給消費者提供了豐富的參考資訊,但同時也帶給消費者資料過載的問題。消費的在線評論可分為兩種:一種是基於體驗品評論,另一種則是基於搜尋品評論。一般來說基於搜尋品的評論會更注重在產品功能上,基於體驗品的評論則由於每個人對體驗品的體驗都不同,通常難以判斷對於評論購買決策是否帶來有益性幫助。
本研究的研究對象為電影評論,通過IMDb網站使用爬蟲技術蒐集電影評論與電影資訊做為本研究資料集,資料集經過預處理後共有116,593筆並整理出Action、Adventure、Drama 3種電影分類、9種特徵類別並以是否使用文字探勘技術拆分為文本類型以及非文本類型。在迴歸預測方法採監督式機器學習模型 Random Forest、XGBoost、Adaboost並設計5項實驗,實驗中特徵類別組合選取上採逐步迴歸向前選擇法來進行。
從實驗結果可以得知比較預測方法的結果以Random Forest為最佳結果;使用非文本相關特徵類別的結果結論閱讀評論者會相信信任的評論者評論並參考此評論者對於電影的投票且評論發佈的時間越久以及評論發佈的時間與電影上映時間越短都對評論有益性的預測有所幫助;使用文本相關特徵類別的結果得知文字向量BERT為最重要的單項文本特徵類別,在組合方面Drama電影類型所要參考的文本特徵類別數量會大於Action/Adventure電影類別,這是因為Drama的電影類別具有較多的劇情,閱讀評論者會較為詳細的去閱讀評論本身而非只是參考評論本身的情緒或關鍵字等;使用非文本+文本相關特徵類別的結果得知文本的特徵類別對於提升電影評論有益性模型的準確度不一定能帶來幫助,最後從實驗結果的數據上證明使用逐步迴歸法確實可以有效找出特徵類別組合並提升評論有益性預測準確度。
摘要(英) Online reviews have become an important reference for consumers in their purchasing decisions, providing them with a wealth of information, but also presenting them with the problem of data overload. online reviews can be divided into two kinds: one is based on experience reviews. Generally, search-based reviews focus more on product features and experience-based reviews are more subjective and emotional because each person′s experience is different, It is difficult to judge the usefulness of the review in making a purchase decision.
The research object of this study is movie reviews. Using NLP technology to analyze the characteristic categories of criticism on electricity the influence of the beneficial effect of film reviews helps consumers to find out the beneficial effect of experience products from a large number of reviews. Via IMDb the website uses crawler technology to collect movie reviews and movie information as the data set of this study. After preprocessing, data is total 116,593. The data were categorized into three movie types: Action, Adventure, and Drama movies, nine feature categories, and whether the feature categories were split into text types and non-text types using text exploration techniques. In this study, we adopt a supervised machine learning model, use Random Forest, XGBoost, and Adaboost in the regression prediction method, and design five sets of experiments for research, use the stepwise regression forward selection method was used to test the combination of different feature categories the best combination of feature categories was selected.
After the experiment, it was found that the Random Forest method could achieve better results. In the non-text category, readers will trust the reviews of trusted reviewers and reference that reviewer′s vote for the movie, the older the review and the shorter the release date of the review and the release date of the film both predicted the beneficial effects of the review text vector. BERT is the most important single text feature category. In terms of combination, the number of textual feature categories to refer to in the Drama genre is larger than the Action/Adventure genre because the reader not only focus in the referring to the comments themselves in terms of mood or keywords, etc., as well as from research experiments it is known that when non-text and text-related feature categories are used, the feature categories of text are beneficial to improve the review of all film categories the accuracy of the model does not necessarily help.Finally, stepwise regression that it can effectively improve the accuracy of the beneficial prediction.
關鍵字(中) ★ 情感分析
★ 逐步迴歸向前選擇法
★ 評論有益性
關鍵字(英) ★ Sentiment Analysis
★ Stepwise regression forward selection
★ Review Helpfulness
論文目次 目錄
第一章 緒論 1
1.1研究背景 1
1.2研究動機 3
1.3研究目的 4
第二章 文獻探討 6
2.1評論有益性 6
2.2電影評論有益性的相關文獻研究 12
第三章 研究方法 16
3.1資料集來源 17
3.2資料預處理 20
3.3研究變數 21
3.4實驗設計 27
3.5資料驗證與評估指標 31
第四章 實證結果分析 33
4.1實驗結果 33
4.2實驗小結 42
第五章 結論與建議 44
5.1研究結論與貢獻 44
5.2研究限制 46
5.3未來研究方向與建議 46
參考文獻 47

圖 1:整體研究流程架構 16
圖 2:IMDb評論頁面爬蟲目標欄位範例 18
圖 3:實驗1文本/非文本/文本+非文本 平均評估指標預測方法比較 27
圖 4:本研究逐步迴歸向前選擇法範例步驟圖 29
圖 5:實驗2非文本類別單項/組合比較 30
圖 6:實驗3文本類別單項/組合比較 30
圖 7:實驗4非文本 + 文本類別單項/組合比較 31
圖 8:實驗5全部特徵變項 + Feature Selection (CFS) 31
圖 9:Weka Feature Selection (CFS) 相關參數設定 41

表 1:評論有益性特徵變項相關文獻整理 14
表 2:資料庫Crawler Table欄位 20
表 3:(文本/非文本/本文+非文本) 平均評估指標預測方法比較結果表 33
表 4:非文本類別單項比較 34
表 5:非文本類別組合比較 35
表 6:文本類別單項比較 36
表 7:文本類別組合比較 37
表 8:非文本 + 文本類別單項比較 38
表 9:非文本 + 文本類別組合比較 40
表 10:電影類別全部特徵變項 + Feature Selection(CFS)結果 41
表 11:Feature Selection (CFS) 結果與實驗4結果比較彙整表 41
指導教授 胡雅涵 審核日期 2023-4-21
