應用文字探勘技術預測企業財務舞弊：以 PTT 股票版及重大訊息為例

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：12

、訪客IP：18.189.182.32

姓名

蘇郁雅(YU-YA SU) 查詢紙本館藏

畢業系所

資訊管理學系在職專班

論文名稱

應用文字探勘技術預測企業財務舞弊：以 PTT 股票版及重大訊息為例
(Using Text Mining Techniques to Predict Financial Fraud: Taking PTT Stock and Material Information as an example)

相關論文

★ 利用資料探勘技術建立商用複合機銷售預測模型	★ 應用資料探勘技術於資源配置預測之研究-以某電腦代工支援單位為例
★ 資料探勘技術應用於航空業航班延誤分析-以C公司為例	★ 全球供應鏈下新產品的安全控管-以C公司為例
★ 資料探勘應用於半導體雷射產業-以A公司為例	★ 應用資料探勘技術於空運出口貨物存倉時間預測-以A公司為例
★ 使用資料探勘分類技術優化YouBike運補作業	★ 特徵屬性篩選對於不同資料類型之影響
★ 資料探勘應用於B2B網路型態之企業官網研究-以T公司為例	★ 衍生性金融商品之客戶投資分析與建議-整合分群與關聯法則技術
★ 應用卷積式神經網路建立肝臟超音波影像輔助判別模型	★ 基於卷積神經網路之身分識別系統
★ 能源管理系統電能補值方法誤差率比較分析	★ 企業員工情感分析與管理系統之研發
★ 資料淨化於類別不平衡問題: 機器學習觀點	★ 資料探勘技術應用於旅客自助報到之分析—以C航空公司為例

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 (2028-7-1以後開放)

摘要(中)

職場舞弊行為不僅對企業財務及商譽造成損失，同時對投資者、員工等利害關係人及社會經濟產生負面影響。現行舞弊預測研究主要使用財務指標或財務報告進行預測，本研究係探討使用企業外部和內部及時文本資料，以社群評論和重大訊息公告為資料來源，並結合不同文字表示方法及分類模型進行實驗，評估可行性及預測效果。
本研究選取2012年至2022年4月於投資人保護中心之求償案件及台灣經濟新報資料庫中，發生舞弊事件之二十家公司，並以資產規模相近之二十家一般公司作為對照，收集新聞曝光前十八個月至新聞曝光前一日之PTT股票版留言及重大訊息主旨。本研究以PTT股票版留言、重大訊息主旨及結合前述二類為資料集，使用三種類型之文字表示方法及分類模型，分別為詞頻—逆向檔案頻率(Term Frequency-Inverse Document Frequency, TF-IDF) 搭配機器學習分類模型、Word2Vec詞向量搭配深度學習模型，以及中文預訓練語言模型BERT與RoBERTa分別建立舞弊偵測模型，並透過超參數優化方式提高模型性能，以比較不同資料集、文字表示方法和分類模型在預測效果之差異。
實驗結果顯示，使用中文RoBERTa語言模型進行微調後，達到最佳之預測效果。使用重大訊息主旨資料集，其AUC (Area Under Curve) 達0.91；使用PTT股票版留言及結合重大訊息主旨及PTT股票版留言資料集之AUC皆達0.82，顯示此三類資料集皆可有效預測舞弊。本研究提供內外部查核人員透過消息面觀點獲取舞弊風險之方法，同時可作為查核資源分配之參考。

摘要(英)

Occupational fraud has detrimental impacts on companies, stakeholders, employees, and the economy, resulting in financial and reputational losses. Existing research on fraud prediction primarily relies on financial indicators or reports. In contrast, this study utilized real-time textual data from external and internal sources of enterprises, focusing on community comments and material information. The objective is to evaluate the feasibility and predictive performance of various text representation methods and classification models.
This study collected data from 20 companies that experienced fraud incidents between 2012 and April 2022, identified from the Securities and Futures Investors Protection Center and the Taiwan Economic Journal (TEJ) database. For comparison, 20 non-fraudulent companies with similar asset scales were included. The dataset comprised text data from PTT stock reviews, material information headlines, and their combination within the 18 months prior to news exposure. 3 datasets utilized Term Frequency-Inverse Document Frequency (TF-IDF) with machine learning models, Word2Vec with deep learning models, and Chinese pre-trained language models (BERT and RoBERTa) to predict fraud. Hyperparameter optimization was performed to enhance model performance and prediction effects were compared across datasets, text representation methods, and classification models.
The fine-tuned Chinese RoBERTa model achieved the best predictive performance with Area Under Curve (AUC) of 0.91 for material information headlines and 0.82 for PTT stock reviews and combined datasets, demonstrating effective fraud prediction across all 3 datasets. This study equips auditors with the ability to identify potential fraud risks from both internal and external perspectives, providing a resource for optimizing audit resource allocation.

關鍵字(中)

★ 舞弊預測
★ 社群評論
★ 重大訊息
★ 文字表示方法

關鍵字(英)

★ Fraud Prediction
★ Social Media Reviews
★ Material Information
★ Text Representations

論文目次

摘要 i
Abstract ii
誌謝 iii
目錄 iv
圖目錄 vi
表目錄 vii
一、緒論 1
1-1 研究背景 1
1-2 研究動機 3
1-3 研究目的 4
二、文獻探討 5
2-1 應用資料探勘技術預測舞弊之研究 5
2-2 文本資料來源 9
2-2-1 社群評論 9
2-2-2 重大訊息公告 10
2-3 文字表示方法 11
2-3-1 TF-IDF 11
2-3-2 Word2Vec 12
2-3-3 BERT及RoBERTa 13
2-4 文本分類技術 15
2-4-1 機器學習演算法 15
2-4-2 深度學習演算法 18
三、研究方法 21
3-1 研究架構 21
3-2 資料收集 23
3-2-1 舞弊公司樣本篩選 23
3-2-2 資料集來源 24
3-2-3 公司名稱判斷 25
3-2-4 非舞弊公司配對 26
3-3 資料預處理 26
3-3-1 資料清洗 26
3-3-2 斷詞 28
3-4 資料集概觀 29
3-5 實驗設計 34
3-5-1 實驗環境 34
3-5-2 TF-IDF搭配機器學習方法 35
3-5-3 Word2Vec搭配深度學習方法 36
3-5-4 預訓練語言模型BERT／RoBERTa之設置 39
3-6 衡量指標 39
四、實驗結果與分析 42
4-1 使用PTT股票版留言資料集之舞弊預測實驗結果 42
4-2 使用重大訊息主旨資料集之舞弊預測實驗結果 44
4-3 結合PTT留言及重大訊息主旨資料集之實驗結果 46
4-4 綜合探討 48
五、研究結論及建議 52
5-1 研究結論與貢獻 52
5-2 研究限制與未來研究建議 54
參考文獻 55

參考文獻

2022 ACFE Report to the Nations. (n.d.). Retrieved April 7, 2023, from https://legacy.acfe.com/report-to-the-nations/2022/
Adamuthe, A. (2020). Improved Text Classification using Long Short-Term Memory and Word Embedding Technique. International Journal of Hybrid Information Technology, 13, 19–32. https://doi.org/10.21742/IJHIT.2020.13.1.03
Ashtiani, M. N., and Raahemi, B. (2022). Intelligent Fraud Detection in Financial Statements Using Machine Learning and Data Mining: A Systematic Literature Review. IEEE Access, 10, 72504–72525. https://doi.org/10.1109/ACCESS.2021.3096799
Ballı, S., & Karasoy, O. (2019). Development of content-based SMS classification application by using Word2Vec-based feature extraction. IET Software, 13(4), 295–304. https://doi.org/10.1049/iet-sen.2018.5046
Bao, Y., Ke, B., Li, B., Yu, Y. J., & Zhang, J. (2020). Detecting Accounting Fraud in Publicly Traded U.S. Firms Using a Machine Learning Approach. Journal of Accounting Research, 58(1), 199–235. https://doi.org/10.1111/1475-679X.12292
Chen, Y.-J., & Chen, Y.-M. (2022). Forecasting corporate credit ratings using big data from social media. Expert Systems with Applications, 207, 118042. https://doi.org/10.1016/j.eswa.2022.118042
Craja, P., Kim, A., & Lessmann, S. (2020). Deep learning for detecting financial statement fraud. Decision Support Systems, 139, 113421. https://doi.org/10.1016/j.dss.2020.113421
Cui, Y., Che, W., Liu, T., Qin, B., & Yang, Z. (2021). Pre-Training with Whole Word Masking for Chinese BERT. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29, 3504–3514. https://doi.org/10.1109/TASLP.2021.3124365
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (arXiv:1810.04805). arXiv. https://doi.org/10.48550/arXiv.1810.04805
Dong, W., Liao, S. S., Xu, Y., & Feng, X. (2016, August). Leading Effect of Social Media for Financial Fraud Disclosure: A Text Mining Based Analytics. AMCIS 2016 Proceedings. 22nd Americas Conference on Information Systems: Surfing the IT Innovation Wave, AMCIS 2016. https://scholars.cityu.edu.hk/en/publications/leading-effect-of-social-media-for-financial-fraud-disclosure(40d18561-cf5e-4b4e-8c64-e4454284f7d6).html
Dong, W., Liao, S., & Zhang, Z. (2018). Leveraging Financial Social Media Data for Corporate Fraud Detection. Journal of Management Information Systems, 35(2), 461–487. https://doi.org/10.1080/07421222.2018.1451954
Hameed, Z., & Garcia-Zapirain, B. (2020). Sentiment Classification Using a Single-Layered BiLSTM Model. IEEE Access, 8, 73992–74001. https://doi.org/10.1109/ACCESS.2020.2988550
Hosmer, D. W., & Lemeshow, S. (2000). Assessing the Fit of the Model. In Applied Logistic Regression (pp. 143–202). John Wiley & Sons, Ltd. https://doi.org/10.1002/0471722146.ch5
Kim, H., & Jeong, Y.-S. (2019). Sentiment Classification Using Convolutional Neural Networks. Applied Sciences, 9(11), Article 11. https://doi.org/10.3390/app9112347
Kowsari, K., Jafari Meimandi, K., Heidarysafa, M., Mendu, S., Barnes, L., & Brown, D. (2019). Text Classification Algorithms: A Survey. Information, 10(4), Article 4. https://doi.org/10.3390/info10040150
Lam, H., & Harcourt, M. (2019). Whistle‐blowing in the digital era: Motives, issues and recommendations. New Technology, Work & Employment, 34(2), 174–190. https://doi.org/10.1111/ntwe.12139
Li, J., Lin, Y., Zhao, P., Liu, W., Cai, L., Sun, J., Zhao, L., Yang, Z., Song, H., Lv, H., & Wang, Z. (2022). Automatic text classification of actionable radiology reports of tinnitus patients using bidirectional encoder representations from transformer (BERT) and in-domain pre-training (IDPT). BMC Medical Informatics and Decision Making, 22(1), 200. https://doi.org/10.1186/s12911-022-01946-y
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019, July 26). RoBERTa: A Robustly Optimized BERT Pretraining Approach. ArXiv.Org. https://arxiv.org/abs/1907.11692v1
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space (arXiv:1301.3781). arXiv. https://doi.org/10.48550/arXiv.1301.3781
Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., & Gao, J. (2021). Deep Learning--based Text Classification: A Comprehensive Review. ACM Computing Surveys, 54(3), 62:1-62:40. https://doi.org/10.1145/3439726
Selva Birunda, S., & Kanniga Devi, R. (2021). A Review on Word Embedding Techniques for Text Classification. In J. S. Raj, A. M. Iliyasu, R. Bestak, & Z. A. Baig (Eds.), Innovative Data Communication Technologies and Application (pp. 267–281). Springer. https://doi.org/10.1007/978-981-15-9651-3_23
Sharif, O., Hossain, E., & Hoque, M. M. (2021). Combating Hostility: Covid-19 Fake News and Hostile Post Detection in Social Media (arXiv:2101.03291). arXiv. https://doi.org/10.48550/arXiv.2101.03291
Soong, G. H., & Tan, C. C. (2021). Sentiment Analysis on 10-K Financial Reports using Machine Learning Approaches. 2021 IEEE 11th International Conference on System Engineering and Technology (ICSET), 124–129. https://doi.org/10.1109/ICSET53708.2021.9612552
Stein, R. A., Jaques, P. A., & Valiati, J. F. (2019). An Analysis of Hierarchical Text Classification Using Word Embeddings. Information Sciences, 471, 216–232. https://doi.org/10.1016/j.ins.2018.09.001
Su, Y., & Kuo, C.-C. J. (2019). On extended long short-term memory and dependent bidirectional recurrent neural network. Neurocomputing, 356, 151–161. https://doi.org/10.1016/j.neucom.2019.04.044
Wang, Y., Pan, Z., Zheng, J., Qian, L., & Mingtao, L. (2019). A hybrid ensemble method for pulsar candidate classification. Astrophysics and Space Science, 364. https://doi.org/10.1007/s10509-019-3602-4
Xiong, F., Chapple, L., & Yin, H. (2018). The use of social media to detect corporate fraud: A case study approach. Business Horizons, 61(4), 623–633. https://doi.org/10.1016/j.bushor.2018.04.002
Xu, G., Meng, Y., Qiu, X., Yu, Z., & Wu, X. (2019). Sentiment Analysis of Comment Texts Based on BiLSTM. IEEE Access, 7, 51522–51532. https://doi.org/10.1109/ACCESS.2019.2909919
郭螢璇. (2023). 以重大訊息文本數據為基礎之上市公司風險預警模型之研究 [銘傳大學]. In 應用統計與資料科學學系碩士班: Vol. 碩士. https://hdl.handle.net/11296/kmv88g
陳柏予. (2022). 應用文字探勘與深度學習技術建立舞弊檢測模型 [國立中正大學]. In 會計與資訊科技研究所: Vol. 碩士. https://hdl.handle.net/11296/nzfgy4
馮少辰. (2022). 以機器學習方式辨認財務危機公司 -納入重大訊息之考量 [東吳大學]. In 會計學系: Vol. 碩士. https://hdl.handle.net/11296/xx4syt
麥嘉蕙. (2021). 探討新聞文本情緒分析與企業舞弊偵測之關聯性研究 [國立政治大學]. In 會計學系: Vol. 碩士. https://hdl.handle.net/11296/jmm8ap

指導教授

蔡志豐(Chih-Fong Tsai)

審核日期

2023-6-28

推文