博碩士論文 110453015 詳細資訊




以作者查詢圖書館館藏 以作者查詢臺灣博碩士 以作者查詢全國書目 勘誤回報 、線上人數:5 、訪客IP:18.188.195.90
姓名 吳明儒(Ming-Ju Wu)  查詢紙本館藏   畢業系所 資訊管理學系在職專班
論文名稱 以文字探勘技術分析標籤劫持—以twitter為例
(Analyzing hashtag hijacking with Text Mining Techniques - A Case Study on Twitter)
相關論文
★ 運用資料探勘技術優化 次世代防火牆規則之研究
檔案 [Endnote RIS 格式]    [Bibtex 格式]    [相關文章]   [文章引用]   [完整記錄]   [館藏目錄]   至系統瀏覽論文 (2028-7-1以後開放)
摘要(中) 隨著社交媒體平台的普及,標籤的使用也逐漸擴散,標籤的主要目的是為了強調和辨識貼文本身的屬性,並且讓使用者可以通過該關鍵字進行搜索相關聯的文章資訊,以獲取所需的資料。然而,近年來,有心人士開始利用標籤搜索的關聯性,在貼文中張貼不相關或具有惡意的訊息,試圖利用這些關鍵字的熱度讓惡意訊息在使用者間散播,這種行為被稱為「標籤劫持」。標籤劫持的氾濫已經大幅度地影響到使用者們的使用體驗,除了經常會浪費大量的時間在閱讀與判別貼文是否真的帶有自己所想要了解的相關資訊外,更甚者遇到較惡意的貼文時更會造成不可逆的結果。因意識到標籤劫持的嚴重性,選擇時下討論熱度高的電子菸作為研究對象,在twitter社交平台中以#vaping作為研究關鍵字來進行標籤劫持的相關分析,並使用人工方式對樣本進行標註,再以監督式學習法建立模型,透過詞頻-逆文檔頻率建立文字特徵,再利用五種分類器(決策樹、支持向量機、隨機森林、邏輯式回歸、梯形提升樹)進行建模和預測,以判斷目標貼文是否遭到標籤劫持。於實驗中利用TF-IDF對文字加權分數作排序並去除不同比例之關鍵字進行實驗評估,以找出效益最好的模型,實驗結果發現在篩選特徵後再進而擷取前1500個文字特徵作為變數下,有部分演算法表現得比未篩選的資料來得要好,其中梯形提升樹在僅取1500個文字特徵時AUC值達0.807。本研究證明再使用較少量的文字特徵以及運算資源下能夠有效建構標籤劫持的分類模型。
摘要(英) With the widespread use of social media platforms, the use of hashtags has also become more prevalent. The primary purpose of hashtags is to emphasize and identify the attributes of a post and allow users to search for related information by using the keyword, enabling them to access the desired content. However, in recent years, malicious actors have started exploiting the relevance of hashtag searches by posting unrelated or malicious content in order to spread such messages among users, a phenomenon known as "hashtag hijacking." The proliferation of hashtag hijacking has significantly affected users′ experience, as they often waste a considerable amount of time discerning whether a post contains the relevant information they are seeking. Moreover, encountering malicious posts can lead to irreversible consequences. Recognizing the seriousness of hashtag hijacking, this study focuses on the popular topic of e-cigarettes and conducts a relevant analysis of hashtag hijacking using the keyword "#vaping" on the twitter social platform. The research employs manual annotation of samples and utilizes supervised learning techniques to build models, leveraging term frequency-inverse document frequency (TF-IDF) to establish text features. Five classifiers, including decision trees, support vector machines, random forests, logistic regression, and gradient boosting, are used for modeling and prediction to determine whether the target posts have been subjected to hashtag hijacking. Experimental evaluations are conducted by applying TF-IDF to weight the text scores, removing keywords at different proportions, and selecting the most effective model. The results show that after feature selection and extracting the top 1500 text features as variables, some algorithms outperform the unfiltered data. In particular, gradient boosting achieves an AUC value of 0.807 when using only the top 1500 text features. This study demonstrates that an effective hashtag hijacking classification model can be constructed using a smaller number of text features and computational resources.
關鍵字(中) ★ 標籤劫持
★ 文字特徵
★ 詞頻-逆文檔頻率
關鍵字(英) ★ twitter
★ hashtag hijacking
★ vaping
★ text features
★ term frequency-inverse document frequency (TF-IDF)
論文目次 第一章 緒論 1
1.1 研究背景 1
1.2 研究動機 3
1.3 研究目的 4
第二章 文獻探討 6
2.1 電子菸 6
2.2 文字探勘與資料文本自然語言處理 7
2.3 運用機器學習方法於貼文分析 8
2.4 Hashtag hijacking於twitter上之文本分析 10
第三章 研究方法 16
3.1 資料集來源 18
3.2 文字預處理 19
3.3 分析方法與工具套件 22
3.4 實驗設計與評估指標 26
第四章 實驗結果 28
4.1 實驗一 28
4.2 實驗二 33
第五章 結論 39
5.1 研究的重要性及貢獻 39
5.2 研究限制 40
5.3 未來研究與建議 41
參考資料 43
參考文獻 Allem, J. P., Dharmapuri, L., Unger, J. B., & Cruz, T. B. (2018). Characterizing JUUL-related posts on twitter. Drug and alcohol dependence, 190, 1-5.
Aphinyanaphongs, Y., Lulejian, A., Brown, D. P., Bonneau, R., & Krebs, P. (2016). Text classification for automatic detection of e-cigarette use and use for smoking cessation from twitter: a feasibility pilot. In Biocomputing 2016: Proceedings of the Pacific Symposium (pp. 480-491).
Benson, R., Hu, M., Chen, A. T., Nag, S., Zhu, S. H., & Conway, M. (2020). Investigating the attitudes of adolescents and young adults towards JUUL: computational study using twitter data. JMIR public health and surveillance, 6(3), e19975.
Bradshaw, A. S. (2022). # DoctorsSpeakUp: exploration of hashtag hijacking by anti-vaccine advocates and the influence of scientific counterpublics on twitter. Health Communication, 1-11.
Demšar, J., Zupan, B., Leban, G., & Curk, T. (2004). Orange: From experimental machine learning to interactive data mining. In Knowledge Discovery in Databases: PKDD 2004: 8th European Conference on Principles and Practice of Knowledge Discovery in Databases, Pisa, Italy, September 20-24, 2004. Proceedings 8 (pp. 537-539). Springer Berlin Heidelberg.
Dunn, K., Taylor, A., & Turfus, S. (2021). A review of cannabidiol‐containing electronic liquids—Current regulations and labelling accuracy. Drug Testing and Analysis, 13(8), 1490-1498.
Elbagir, S., & Yang, J. (2019, March). twitter sentiment analysis using natural language toolkit and VADER sentiment. In Proceedings of the international multiconference of engineers and computer scientists (Vol. 122, p. 16).
Fushiki, T. (2011). Estimation of prediction error by using K-fold cross-validation. Statistics and Computing, 21, 137-146.
Gross, J., Tomczak, T., & Gollnhofer, J. F. (2022). Brand-related content in social media: Consumers as social media influencers. https://www.alexandria.unisg.ch/handle/20.500.14171/109325
Hadgu, A. T., Garimella, K., & Weber, I. (2013). Political hashtag hijacking in the U.S. 55–56. https://doi.org/10.1145/2487788.2487809
Jain, N., Agarwal, P., & Pruthi, J. (2015). HashJacker- Detection and Analysis of Hashtag Hijacking on twitter. International Journal of Computer Applications, 114(19), 17–20. https://doi.org/10.5120/20085-2111
Jockers, M., & Thalken, R. (2020). Part of Speech Tagging and Named Entity Recognition (頁 237–245). https://doi.org/10.1007/978-3-030-39643-5_18
Ketonen, V., & Malik, A. (2020). Characterizing vaping posts on Instagram by using unsupervised machine learning. International Journal of Medical Informatics, 141, 104223. https://doi.org/10.1016/j.ijmedinf.2020.104223
Khachatoorian, C., Jacob, P., Benowitz, N. L., & Talbot, P. (2019). Electronic Cigarette Chemicals Transfer from a Vape Shop to a Nearby Business in a Multiple-Tenant Retail Building. Tobacco control, 28(5), 519–525. https://doi.org/10.1136/tobaccocontrol-2018-054316
Liu, X., Shin, H., & Burns, A. C. (2021). Examining the impact of luxury brand’s social media marketing on customer engagement: Using big data analytics and natural language processing. Journal of Business Research, 125(C), 815–826.
Luoma-aho, V., Virolainen, M., Lievonen, M., & Halff, G. (2018). Brand Hijacked: Why Campaigns and Hashtags are Taken over by Audiences. https://jyx.jyu.fi/handle/123456789/59119
Madatov, K., Bekchanov, S., & Vičič, J. (2023). Uzbek text summarization based on TF-IDF (arXiv:2303.00461). arXiv. https://doi.org/10.48550/arXiv.2303.00461
McCausland, K., Maycock, B., Leaver, T., & Jancey, J. (2019). The messages presented in electronic cigarette–related social media promotions and discussion: scoping review. Journal of Medical Internet Research, 21(2), e11953.
McNeill, A., Brose, L., Robson, D., Calder, R., & Simonavicius, E. (2021). Vaping in England: an evidence update including vaping for smoking cessation.
Miech, R., Johnston, L., O’Malley, P. M., Bachman, J. G., & Patrick, M. E. (2019). Trends in Adolescent Vaping, 2017-2019. The New England Journal of Medicine, 381(15), 1490–1491. https://doi.org/10.1056/NEJMc1910739
Mishra, S., Shukla, P., & Agarwal, R. (2022). Analyzing machine learning enabled fake news
detection techniques for diversified datasets. Wireless Communications and Mobile
Computing, 2022, 1-18. https://www.hindawi.com/journals/wcmc/2022/1575365/
Mousavi, P., & Ouyang, J. (2021). Detecting Hashtag Hijacking for Hashtag Activism. Proceedings of the 1st Workshop on NLP for Positive Impact, 82–92. https://doi.org/10.18653/v1/2021.nlp4posimpact-1.9
Müller, M., Salathé, M., & Kummervold, P. E. (2020). COVID-twitter-BERT: A Natural Language Processing Model to Analyse COVID-19 Content on twitter (arXiv:2005.07503). arXiv. https://doi.org/10.48550/arXiv.2005.07503
Muramatsu, J., & Pratt, W. (2001, September). Transparent Queries: investigation users′
mental models of search engines. In Proceedings of the 24th annual international
ACM SIGIR conference on Research and development in information retrieval (pp.
217-224). https://dl.acm.org/doi/abs/10.1145/383952.383991
Navigli, R., Barba, E., Conia, S., & Blloshmi, R. (2022). A Tour of Explicit Multilingual Semantics: Word Sense Disambiguation, Semantic Role Labeling and Semantic Parsing. Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing: Tutorial Abstracts, 35–43. https://aclanthology.org/2022.aacl-tutorials.6
Pang, A., Limsico, J. I. L., Phong, L., Lareza, B. J. L., & Low, S. Y. (2018). 16. Reputational damage on twitter# hijack. From Media Hype to twitter Storm, 355.
Pike, J. R., Tan, N., Miller, S., Cappelli, C., Xie, B., & Stacy, A. W. (2019). The Effect of E-cigarette Commercials on Youth Smoking: A Prospective Study. American journal of health behavior, 43(6), 1103–1118. https://doi.org/10.5993/AJHB.43.6.8
Qiu, X., Sun, T., Xu, Y., Shao, Y., Dai, N., & Huang, X. (2020). Pre-trained Models for Natural Language Processing: A Survey. Science China Technological Sciences, 63(10), 1872–1897. https://doi.org/10.1007/s11431-020-1647-3
Raza, S., & Ding, C. (2022). Fake news detection based on news content and social contexts:
a transformer-based approach. International Journal of Data Science and Analytics,
13(4), 335-362. https://link.springer.com/article/10.1007/s41060-021-00302-z
Reuben, M., Elyashar, A., & Puzis, R. (2022). Iterative query selection for opaque search
engines with pseudo relevance feedback. Expert Systems with Applications, 201, 117027. https://www.sciencedirect.com/science/article/pii/S0957417422004432
Roberts, D. F. (2000). Media and youth: Access, exposure, and privatization. Journal of Adolescent Health, 27(2, Supplement 1), 8–14. https://doi.org/10.1016/S1054-139X(00)00128-2
Ruppel, T., Alexander, B., & Mayrovitz, H. N. (2021). Assessing vaping views, usage, and vaping-related education among medical students: A pilot study. Cureus, 13(2). Schmitt, X., Kubler, S., Robert, J., Papadakis, M., & LeTraon, Y. (2019). A Replicable Comparison Study of NER Software: StanfordNLP, NLTK, OpenNLP, SpaCy, Gate. 2019 Sixth International Conference on Social Networks Analysis, Management and Security (SNAMS), 338–343. https://doi.org/10.1109/SNAMS.2019.8931850
Shahzad, K., Khan, S. A., Ahmad, S., & Iqbal, A. (2022). A Scoping Review of the
Relationship of Big Data Analytics with Context-Based Fake News Detection on
Digital Media in Data Age. Sustainability, 14(21), 14365.
https://www.mdpi.com/2071-1050/14/21/14365
Siano, A., Confetto, M. G., Vollero, A., & Covucci, C. (2021). Redefining brand hijacking from a non-collaborative brand co-creation perspective. Journal of Product & Brand Management, 31(1), 110–126. https://doi.org/10.1108/JPBM-03-2020-2780
Sundar, S. S., & Limperos, A. M. (2013). Uses and Grats 2.0: New Gratifications for New Media. Journal of Broadcasting & Electronic Media, 57(4), 504–525. https://doi.org/10.1080/08838151.2013.845827
Taylor, J., Wiens, T., Peterson, J., Saravia, S., Lunda, M., Hanson, K., Wogen, M., D’Heilly, P., Margetta, J., Bye, M., Cole, C., Mumm, E., Schwerzler, L., Makhtal, R., Danila, R., Lynfield, R., Holzbauer, S., Blount, B. C., Karwowski, M. P., … Valentin-Blasini, L. (2019). Characteristics of E-cigarette, or Vaping, Products Used by Patients with Associated Lung Injury and Products Seized by Law Enforcement—Minnesota, 2018 and 2019. Morbidity and Mortality Weekly Report, 68(47), 1096–1100. https://doi.org/10.15585/mmwr.mm6847e1
Vandam, C., & Tan, P.-N. (2016). Detecting hashtag hijacking from twitter. 370–371. https://doi.org/10.1145/2908131.2908179
Visweswaran, S., Colditz, J. B., O’Halloran, P., Han, N.-R., Taneja, S. B., Welling, J., Chu, K.-H., Sidani, J. E., & Primack, B. A. (2020). Machine Learning Classifiers for twitter Surveillance of Vaping: Comparative Machine Learning Study. Journal of Medical Internet Research, 22(8), e17478. https://doi.org/10.2196/17478
Wakefield, M., Flay, B., Nichter, M., & Giovino, G. (2003). Role of the media in influencing trajectories of youth smoking. Addiction (Abingdon, England), 98 Suppl 1, 79–103. https://doi.org/10.1046/j.1360-0443.98.s1.6.x
Walley, S. C., Wilson, K. M., Winickoff, J. P., & Groner, J. (2019). A Public Health Crisis: Electronic Cigarettes, Vape, and JUUL. Pediatrics, 143(6), e20182741. https://doi.org/10.1542/peds.2018-2741
Wallner, T. S., Magnier, L. B. M., & Mugge, R. (2022). Buying new or refurbished?: PLATE 2021. 4th Conference on Product Lifetimes and the Environment (PLATE), 1–6. https://doi.org/10.31880/10344/10172
Wilie, B., Vincentio, K., Winata, G. I., Cahyawijaya, S., Li, X., Lim, Z. Y., Soleman, S., Mahendra, R., Fung, P., Bahar, S., & Purwarianti, A. (2020). IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding (arXiv:2009.05387). arXiv. https://doi.org/10.48550/arXiv.2009.05387
Xanthopoulos, P., Panagopoulos, O. P., Bakamitsos, G. A., & Freudmann, E. (2016). Hashtag hijacking: What it is, why it happens and how to avoid it. Journal of Digital & Social Media Marketing, 3(4), 353–362.
Zhang, Z., & Zhang, D. (2021). What is Data Science? An Operational Definition based on Text Mining of Data Science Curricula. Journal of Behavioral Data Science, 1(1), Article 1. https://doi.org/10.35566/jbds/v1n1/p1
指導教授 胡雅涵 周恩頤 審核日期 2023-6-29
推文 facebook   plurk   twitter   funp   google   live   udn   HD   myshare   reddit   netvibes   friend   youpush   delicious   baidu   
網路書籤 Google bookmarks   del.icio.us   hemidemi   myshare   

若有論文相關問題,請聯絡國立中央大學圖書館推廣服務組 TEL:(03)422-7151轉57407,或E-mail聯絡  - 隱私權政策聲明