dc.description.abstract | With the widespread use of social media platforms, the use of hashtags has also become more prevalent. The primary purpose of hashtags is to emphasize and identify the attributes of a post and allow users to search for related information by using the keyword, enabling them to access the desired content. However, in recent years, malicious actors have started exploiting the relevance of hashtag searches by posting unrelated or malicious content in order to spread such messages among users, a phenomenon known as "hashtag hijacking." The proliferation of hashtag hijacking has significantly affected users′ experience, as they often waste a considerable amount of time discerning whether a post contains the relevant information they are seeking. Moreover, encountering malicious posts can lead to irreversible consequences. Recognizing the seriousness of hashtag hijacking, this study focuses on the popular topic of e-cigarettes and conducts a relevant analysis of hashtag hijacking using the keyword "#vaping" on the twitter social platform. The research employs manual annotation of samples and utilizes supervised learning techniques to build models, leveraging term frequency-inverse document frequency (TF-IDF) to establish text features. Five classifiers, including decision trees, support vector machines, random forests, logistic regression, and gradient boosting, are used for modeling and prediction to determine whether the target posts have been subjected to hashtag hijacking. Experimental evaluations are conducted by applying TF-IDF to weight the text scores, removing keywords at different proportions, and selecting the most effective model. The results show that after feature selection and extracting the top 1500 text features as variables, some algorithms outperform the unfiltered data. In particular, gradient boosting achieves an AUC value of 0.807 when using only the top 1500 text features. This study demonstrates that an effective hashtag hijacking classification model can be constructed using a smaller number of text features and computational resources. | en_US |