運用深度學習進行求職詐騙預測：結構化特徵、非結構化特徵與特徵融合之研究;Application of Deep Learning for Job Fraud Prediction：A Study on Structured Features, Unstructured Features, and Feature Fusion

NCU Institutional Repository > 管理學院 > 資訊管理研究所 > 博碩士論文 > Item 987654321/98330

jsp.display-item.identifier=請使用永久網址來引用或連結此文件: https://ir.lib.ncu.edu.tw/handle/987654321/98330

题名:	運用深度學習進行求職詐騙預測：結構化特徵、非結構化特徵與特徵融合之研究;Application of Deep Learning for Job Fraud Prediction：A Study on Structured Features, Unstructured Features, and Feature Fusion
作者:	陳瑋眞;Chen, Wei-Chen
贡献者:	資訊管理學系
关键词:	機器學習;深度學習;自然語言處理;LSTM;Bi-LSTM;Machine Learning;Deep Learning;NLP;LSTM;Bi-LSTM
日期:	2025-07-22
上传时间:	2025-10-17 12:38:24 (UTC+8)
出版者:	國立中央大學
摘要:	隨著網路與科技快速發展，線上求職已成為現代人找工作的主要管道。然而，也因此衍生出大量具有詐騙性質的職缺廣告。尤其在疫情之後，遠端面試與遠距工作的型態更為普及，使得詐騙情形更加嚴重。這類求職詐騙常利用「高薪」、「免經驗」或「歡迎應屆畢業生」等關鍵字吸引民眾，除了威脅到求職者的財務與個資安全，也連帶影響企業的整體形象。雖然政府與求職平台持續進行防詐宣導，但詐騙手法層出不窮，光靠傳統的預防方式已難以有效應對。根據內政部統計資料顯示，求職詐騙案件的成長幅度居高不下，凸顯其問題的嚴重性。過往相關研究大多著重在宣導防詐觀念，較少針對實際資料進行深入分析。直到Kaggle上公開的 EMSCAD（Employment Scam Aegean Dataset）資料集，收錄了近18,000筆真實的職缺資訊，才為學界提供一個可用於實證研究的基礎。目前雖已有部分研究使用此資料集進行分析，但大多仍採用傳統的機器學習方法。本研究的目的是希望運用深度學習技術結合文字探勘方法，來有效辨識與攔截線上求職詐騙。透過三階段實驗設計，分別比較不同文字向量化維度、特徵類型（結構化、非結構化、融合）及資料平衡對模型效能的影響。並搭配多種分類演算法（如SVM、隨機森林、邏輯迴歸等）與深度學習模型（如LSTM與Bi-LSTM）進比較，並搭配不同的文字向量表示方式（BERT、TF-IDF、Word2Vec）。結果顯示，TF-IDF（5000維）與Word2Vec（200維）為整體表現最佳的文字特徵設定；融合特徵模型普遍優於單一特徵，能顯著提升分類效能，其中以TF-IDF結合結構化資料並搭配隨機森林、SVM與LSTM模型，最高可達AUC 0.98、F1-score 0.89的優異表現。進一步以1：1比例製作小量平衡資料集後，F1-score明顯提升，顯示對少數類別的識別效果改善。然而，即使在原始不平衡資料條件下，部分模型仍展現出穩定的預測準確率，AUC整體多落在0.93以上，證明融合特徵搭配合適模型，在資料分佈不均情況下仍具良好判別能力，為建構實用的求職詐騙偵測機制提供實證依據。 ;With the rapid growth of internet and technology, online job hunting has become a mainstream approach. However, this also brings a surge in fake job postings. Especially after the pandemic, the rise of remote interviews and telecommuting has made scams more prevalent. These fraudulent ads often use keywords like “high salary,” “no experience,” or “fresh graduates welcome” to attract job seekers, posing risks to personal data and company reputation. Despite ongoing anti-fraud efforts, traditional measures are often insufficient against constantly evolving scam tactics. According to government statistics, job fraud cases continue to rise, reflecting the urgency of the problem. Previous studies mainly focus on promoting awareness, with limited use of real-world data. The release of the EMSCAD dataset on Kaggle, with nearly 18,000 real job posts, offers a valuable foundation for empirical research. However, most current studies still rely on traditional machine learning methods. This study adopts deep learning and text mining techniques to build a fraud detection model. A three-phase experiment compares the effects of various text vector dimensions, feature types (structured, unstructured, and combined), and data balancing strategies on model performance. Models used include SVM, Random Forest, Logistic Regression, LSTM, and Bi-LSTM, along with embedding methods such as BERT, TF-IDF, and Word2Vec. The results show that TF-IDF (5000 dimensions) and Word2Vec (200 dimensions) offer the best text representation. Models using combined features outperform those using only structured or unstructured data. TF-IDF combined with structured features and models like Random Forest, SVM, and LSTM achieved AUC up to 0.98 and F1-score up to 0.89. When using a small 1:1 balanced dataset, F1-score improved significantly, enhancing the detection of minority (fraud) cases. Even under imbalanced conditions, AUC values remained high, often above 0.93, proving that combined features with suitable models can still deliver robust fraud detection.
显示于类别:	[資訊管理研究所] 博碩士論文

文件中的档案:

档案	描述	大小	格式	浏览次数
index.html		0Kb	HTML	44	检视/开启

在NCUIR中所有的数据项都受到原著作权保护.

社群 sharing

数据加载中.....