摘要: | 隨著網際網路和社群媒體的普及以及Covid-19疫情影響,葡萄酒消費者越來越依賴線上評論來決定購買選擇。本研究旨在比較不同文本特徵值萃取方法在葡萄酒評論文本分類中的效果,以期對葡萄酒評論文本分類技術的發展和應用做出貢獻並提高消費者在購買葡萄酒時的選擇效率。本研究首先從VIVINO葡萄酒評論網爬取1500則評論並請專家標記香氣與口感類別,經過資料預處理後,分別使用TF-IDF、Doc2vec和BERT-word embedding三種文本特徵選取方法產生字詞向量。接著搭配Naive Bayes、Logistic Regression、Random Forest、Support Vector Machine和XGBoost五種分類模型,探討不同的特徵表示法與分類器在文本分類中的表現和適用性。研究結果顯示,最適合本次紅酒資料集五個目標變數的模型組合皆為使用TF-IDF文字轉譯器搭配XGBoost分類模型,這種組合的預測準確率皆高於0.8,表現出色。此外,使用樣本合成法SMOTE來解決樣本不平衡問題時,模型的結果有小幅度提升,尤其是Accuracy與Precision。但當原始樣本過於龐大時,SMOTE可能不值得使用,因為需要耗費較多的時間處理資料不平衡,而僅能提升小幅度的效果。;With the widespread use of the internet and social media, as well as the impact of the Covid-19 pandemic, wine consumers are increasingly relying on online reviews to make purchasing decisions. This study aims to compare the effectiveness of different text feature extraction methods in wine review text classification, in order to contribute to the development and application of wine review text classification techniques and improve the efficiency of consumers′ choices when purchasing wine. In this study, we first crawled 1,500 reviews from the VIVINO wine review website and asked experts to label aroma and taste categories. After data preprocessing, we used TFIDF, doc2vec, and BERT-word embedding methods to generate word vectors. We then paired these with five classification models, namely Naive Bayes, Logistic Regression, Random Forest, Support Vector Machine, and XGBoost, to explore the performance and applicability of different feature representations and classifiers in text classification. The results showed that the most suitable model combination for the five target variables of this wine dataset was using the Tf-idf text transformer paired with the XGBoost classification model, which had a prediction accuracy of more than 0.8, demonstrating excellent performance. Moreover, when using the Synthetic Minority Over-sampling Technique (SMOTE) to address the issue of sample imbalance, there was a slight improvement in the model′s results, especially in terms of accuracy and precision. However, when the original sample size is too large, SMOTE may not be worth using, as it requires more time to process data imbalance and only results in a slight improvement in performance. |