dc.description.abstract | Occupational fraud has detrimental impacts on companies, stakeholders, employees, and the economy, resulting in financial and reputational losses. Existing research on fraud prediction primarily relies on financial indicators or reports. In contrast, this study utilized real-time textual data from external and internal sources of enterprises, focusing on community comments and material information. The objective is to evaluate the feasibility and predictive performance of various text representation methods and classification models.
This study collected data from 20 companies that experienced fraud incidents between 2012 and April 2022, identified from the Securities and Futures Investors Protection Center and the Taiwan Economic Journal (TEJ) database. For comparison, 20 non-fraudulent companies with similar asset scales were included. The dataset comprised text data from PTT stock reviews, material information headlines, and their combination within the 18 months prior to news exposure. 3 datasets utilized Term Frequency-Inverse Document Frequency (TF-IDF) with machine learning models, Word2Vec with deep learning models, and Chinese pre-trained language models (BERT and RoBERTa) to predict fraud. Hyperparameter optimization was performed to enhance model performance and prediction effects were compared across datasets, text representation methods, and classification models.
The fine-tuned Chinese RoBERTa model achieved the best predictive performance with Area Under Curve (AUC) of 0.91 for material information headlines and 0.82 for PTT stock reviews and combined datasets, demonstrating effective fraud prediction across all 3 datasets. This study equips auditors with the ability to identify potential fraud risks from both internal and external perspectives, providing a resource for optimizing audit resource allocation. | en_US |