dc.description.abstract | With the widespread use of the internet and social media, as well as the impact of the Covid-19 pandemic, wine consumers are increasingly relying on online reviews to make purchasing decisions. This study aims to compare the effectiveness of different text feature extraction methods in wine review text classification, in order to contribute to the development and application of wine review text classification techniques and improve the efficiency of consumers′ choices when purchasing wine. In this study, we first crawled 1,500 reviews from the VIVINO wine review website and asked experts to label aroma and taste categories. After data preprocessing, we used TFIDF, doc2vec, and BERT-word embedding methods to generate word vectors. We then paired these with five classification models, namely Naive Bayes, Logistic Regression, Random Forest, Support Vector Machine, and XGBoost, to explore the performance and applicability of different feature representations and classifiers in text classification. The results showed that the most suitable model combination for the five target variables of this wine dataset was using the Tf-idf text transformer paired with the XGBoost classification model, which had a prediction accuracy of more than 0.8, demonstrating excellent performance. Moreover, when using the Synthetic Minority Over-sampling Technique (SMOTE) to address the issue of sample imbalance, there was a slight improvement in the model′s results, especially in terms of accuracy and precision. However, when the original sample size is too large, SMOTE may not be worth using, as it requires more time to process data imbalance and only results in a slight improvement in performance. | en_US |