dc.description.abstract | In today′s world, everyone can comment on many public posts, including newspapers, magazines and books you have ever read. Online reviews are considered as trustworthy. Users can provide online reviews through several ways such as star ratings, text, images, and videos. Most users will also browse the reviews on the websites before purchasing goods and experiencing. This constant state of information overload is caused by the Internet that contains too much information; hence data mining techniques can be employed to solve this problem.
This thesis considers the helpfulness of online hotel reviews for the research. During the data preprocessing, we found that it is very common that real-world review datasets usually contain certain numbers of missing attribute values. In literature, there is no a study focus on examining the performances of different types of techniques to handle incomplete online review datasets.
The experiment is composed of two studies. In the first study, the dataset is collected from TripAdvisor, where some reviewer related information is missing, such as reviewer level, age, sex, etc. Three types of techniques are compared, which are case deletion, imputation methods including mean/mode, KNN, and SVM, and directly handle the incomplete dataset without imputation by C5.0. In the second study, the raining information is simulated for 10% to 50% missing rates of the dataset. The experiment results of the two studies show that the C5.0 decision tree algorithm is the better choice for dealing with missing values in online review datasets. | en_US |