以文字探勘技術分析標籤劫持—以twitter為例;Analyzing hashtag hijacking with Text Mining Techniques - A Case Study on Twitter

NCU Institutional Repository > 管理學院 > 資訊管理學系碩士在職專班 > 博碩士論文 > Item 987654321/93122

請使用永久網址來引用或連結此文件: http://ir.lib.ncu.edu.tw/handle/987654321/93122

題名:	以文字探勘技術分析標籤劫持—以twitter為例;Analyzing hashtag hijacking with Text Mining Techniques - A Case Study on Twitter
作者:	吳明儒;Wu, Ming-Ju
貢獻者:	資訊管理學系在職專班
關鍵詞:	標籤劫持;文字特徵;詞頻-逆文檔頻率;twitter;hashtag hijacking;vaping;text features;term frequency-inverse document frequency (TF-IDF)
日期:	2023-06-29
上傳時間:	2024-09-19 16:43:24 (UTC+8)
出版者:	國立中央大學
摘要:	隨著社交媒體平台的普及，標籤的使用也逐漸擴散，標籤的主要目的是為了強調和辨識貼文本身的屬性，並且讓使用者可以通過該關鍵字進行搜索相關聯的文章資訊，以獲取所需的資料。然而，近年來，有心人士開始利用標籤搜索的關聯性，在貼文中張貼不相關或具有惡意的訊息，試圖利用這些關鍵字的熱度讓惡意訊息在使用者間散播，這種行為被稱為「標籤劫持」。標籤劫持的氾濫已經大幅度地影響到使用者們的使用體驗，除了經常會浪費大量的時間在閱讀與判別貼文是否真的帶有自己所想要了解的相關資訊外，更甚者遇到較惡意的貼文時更會造成不可逆的結果。因意識到標籤劫持的嚴重性，選擇時下討論熱度高的電子菸作為研究對象，在twitter社交平台中以#vaping作為研究關鍵字來進行標籤劫持的相關分析，並使用人工方式對樣本進行標註，再以監督式學習法建立模型，透過詞頻-逆文檔頻率建立文字特徵，再利用五種分類器（決策樹、支持向量機、隨機森林、邏輯式回歸、梯形提升樹）進行建模和預測，以判斷目標貼文是否遭到標籤劫持。於實驗中利用TF-IDF對文字加權分數作排序並去除不同比例之關鍵字進行實驗評估，以找出效益最好的模型，實驗結果發現在篩選特徵後再進而擷取前1500個文字特徵作為變數下，有部分演算法表現得比未篩選的資料來得要好，其中梯形提升樹在僅取1500個文字特徵時AUC值達0.807。本研究證明再使用較少量的文字特徵以及運算資源下能夠有效建構標籤劫持的分類模型。;With the widespread use of social media platforms, the use of hashtags has also become more prevalent. The primary purpose of hashtags is to emphasize and identify the attributes of a post and allow users to search for related information by using the keyword, enabling them to access the desired content. However, in recent years, malicious actors have started exploiting the relevance of hashtag searches by posting unrelated or malicious content in order to spread such messages among users, a phenomenon known as "hashtag hijacking." The proliferation of hashtag hijacking has significantly affected users′ experience, as they often waste a considerable amount of time discerning whether a post contains the relevant information they are seeking. Moreover, encountering malicious posts can lead to irreversible consequences. Recognizing the seriousness of hashtag hijacking, this study focuses on the popular topic of e-cigarettes and conducts a relevant analysis of hashtag hijacking using the keyword "#vaping" on the twitter social platform. The research employs manual annotation of samples and utilizes supervised learning techniques to build models, leveraging term frequency-inverse document frequency (TF-IDF) to establish text features. Five classifiers, including decision trees, support vector machines, random forests, logistic regression, and gradient boosting, are used for modeling and prediction to determine whether the target posts have been subjected to hashtag hijacking. Experimental evaluations are conducted by applying TF-IDF to weight the text scores, removing keywords at different proportions, and selecting the most effective model. The results show that after feature selection and extracting the top 1500 text features as variables, some algorithms outperform the unfiltered data. In particular, gradient boosting achieves an AUC value of 0.807 when using only the top 1500 text features. This study demonstrates that an effective hashtag hijacking classification model can be constructed using a smaller number of text features and computational resources.
顯示於類別:	[資訊管理學系碩士在職專班 ] 博碩士論文

文件中的檔案:

檔案	描述	大小	格式	瀏覽次數
index.html		0Kb	HTML	8	檢視/開啟

在NCUIR中所有的資料項目都受到原著作權保護.

社群 sharing

資料載入中.....