以文字探勘技術分析標籤劫持—以twitter為例

DC 欄位	值	語言
DC.contributor	資訊管理學系在職專班	zh_TW
DC.creator	吳明儒	zh_TW
DC.creator	Ming-Ju Wu	en_US
dc.date.accessioned	2023-6-29T07:39:07Z
dc.date.available	2023-6-29T07:39:07Z
dc.date.issued	2023
dc.identifier.uri	http://ir.lib.ncu.edu.tw:88/thesis/view_etd.asp?URN=110453015
dc.contributor.department	資訊管理學系在職專班	zh_TW
DC.description	國立中央大學	zh_TW
DC.description	National Central University	en_US
dc.description.abstract	隨著社交媒體平台的普及，標籤的使用也逐漸擴散，標籤的主要目的是為了強調和辨識貼文本身的屬性，並且讓使用者可以通過該關鍵字進行搜索相關聯的文章資訊，以獲取所需的資料。然而，近年來，有心人士開始利用標籤搜索的關聯性，在貼文中張貼不相關或具有惡意的訊息，試圖利用這些關鍵字的熱度讓惡意訊息在使用者間散播，這種行為被稱為「標籤劫持」。標籤劫持的氾濫已經大幅度地影響到使用者們的使用體驗，除了經常會浪費大量的時間在閱讀與判別貼文是否真的帶有自己所想要了解的相關資訊外，更甚者遇到較惡意的貼文時更會造成不可逆的結果。因意識到標籤劫持的嚴重性，選擇時下討論熱度高的電子菸作為研究對象，在twitter社交平台中以#vaping作為研究關鍵字來進行標籤劫持的相關分析，並使用人工方式對樣本進行標註，再以監督式學習法建立模型，透過詞頻-逆文檔頻率建立文字特徵，再利用五種分類器（決策樹、支持向量機、隨機森林、邏輯式回歸、梯形提升樹）進行建模和預測，以判斷目標貼文是否遭到標籤劫持。於實驗中利用TF-IDF對文字加權分數作排序並去除不同比例之關鍵字進行實驗評估，以找出效益最好的模型，實驗結果發現在篩選特徵後再進而擷取前1500個文字特徵作為變數下，有部分演算法表現得比未篩選的資料來得要好，其中梯形提升樹在僅取1500個文字特徵時AUC值達0.807。本研究證明再使用較少量的文字特徵以及運算資源下能夠有效建構標籤劫持的分類模型。	zh_TW
dc.description.abstract	With the widespread use of social media platforms, the use of hashtags has also become more prevalent. The primary purpose of hashtags is to emphasize and identify the attributes of a post and allow users to search for related information by using the keyword, enabling them to access the desired content. However, in recent years, malicious actors have started exploiting the relevance of hashtag searches by posting unrelated or malicious content in order to spread such messages among users, a phenomenon known as ＂hashtag hijacking.＂ The proliferation of hashtag hijacking has significantly affected users′ experience, as they often waste a considerable amount of time discerning whether a post contains the relevant information they are seeking. Moreover, encountering malicious posts can lead to irreversible consequences. Recognizing the seriousness of hashtag hijacking, this study focuses on the popular topic of e-cigarettes and conducts a relevant analysis of hashtag hijacking using the keyword ＂#vaping＂ on the twitter social platform. The research employs manual annotation of samples and utilizes supervised learning techniques to build models, leveraging term frequency-inverse document frequency (TF-IDF) to establish text features. Five classifiers, including decision trees, support vector machines, random forests, logistic regression, and gradient boosting, are used for modeling and prediction to determine whether the target posts have been subjected to hashtag hijacking. Experimental evaluations are conducted by applying TF-IDF to weight the text scores, removing keywords at different proportions, and selecting the most effective model. The results show that after feature selection and extracting the top 1500 text features as variables, some algorithms outperform the unfiltered data. In particular, gradient boosting achieves an AUC value of 0.807 when using only the top 1500 text features. This study demonstrates that an effective hashtag hijacking classification model can be constructed using a smaller number of text features and computational resources.	en_US
DC.subject	標籤劫持	zh_TW
DC.subject	文字特徵	zh_TW
DC.subject	詞頻-逆文檔頻率	zh_TW
DC.subject	twitter	en_US
DC.subject	hashtag hijacking	en_US
DC.subject	vaping	en_US
DC.subject	text features	en_US
DC.subject	term frequency-inverse document frequency (TF-IDF)	en_US
DC.title	以文字探勘技術分析標籤劫持—以twitter為例	zh_TW
dc.language.iso	zh-TW	zh-TW
DC.title	Analyzing hashtag hijacking with Text Mining Techniques - A Case Study on Twitter	en_US
DC.type	博碩士論文	zh_TW
DC.type	thesis	en_US
DC.publisher	National Central University	en_US

博碩士論文 110453015 完整後設資料紀錄