A Novel Diffusion-Based Spelling Checking on Hybrid Chinese Characters

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：136

、訪客IP：18.191.132.211

姓名

蔡亞真(Ya-Jen Tsai) 查詢紙本館藏

畢業系所

資訊管理學系

論文名稱

(A Novel Diffusion-Based Spelling Checking on Hybrid Chinese Characters)

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 (2029-7-1以後開放)

摘要(中)

為了促進與不同地區人們的交流，書信寫作變得普遍，隨著時代變遷，通訊方式變得更加多樣和便捷。從傳統信件、短信、電子郵件到消息應用程式的演變，文本使用的嚴謹程度各不相同，消息變得更加口語化。儘管這種轉變對人際交流並未構成重大問題，但在互聯網和深度學習應用的時代，文本內容的準確性和識別已成為關鍵問題。在依賴文本的應用程式如聊天機器人和搜索引擎中，拼寫錯誤會導致錯誤的判斷，妨礙預期結果的實現。因此，拼寫檢查具有至關重要的作用。以往的研究通常集中於簡體中文或繁體中文，這會因為這兩種書寫系統的差異而導致誤判。此外，俚語和縮寫的普及使當前模型無法解讀這些內容。為了解決這些挑戰，本研究介紹了一種基於擴散的新方法——DiffuCSC，旨在克服現有研究的局限性，提供更好的中文拼寫檢查和修正。

摘要(英)

To facilitate communication with people in different locations, the prevalence of letter writing began, and as times changed, the modes of communication became more diverse and convenient. From traditional letters, text messages, and emails to the evolution of messaging apps, the rigor of text usage has varied, with messages becoming more conversational in nature. While this shift has not posed a significant problem for interpersonal communication, the accu-racy and recognition of text content have become critical issues in the internet era and deep learning applications. In text-reliant applications such as chatbots and search engines, incorrect spelling can lead to erroneous judgments, thwarting the intended outcomes. Hence, spelling checking holds paramount importance. Previous research often focused on either Simplified or Traditional Chinese, leading to misjudgments due to the differences between these scripts. Addi-tionally, the proliferation of slang and abbreviations presents content that is undecipherable by current models. To address these challenges, this study introduces a novel approach for mixed Chinese Spelling Checking based on diffusion—DiffuCSC, aimed at overcoming the limitations of existing research and providing improved Chinese spelling checking and correction.

關鍵字(中)

★ 自然語言處理
★ 中文拼寫檢查
★ 擴散模型

關鍵字(英)

★ Natural Language Processing
★ Chinese Spelling Checking
★ Diffusion Model

論文目次

摘要 i
Abstract ii
Table of Contents iii
List of Figures iv
List of Tables v
I. Introduction 1
II. Related Work 7
2.1 Chinese Spelling Checking 7
2.2 Denoising diffusion probabilistic models 11
III.Proposed Method 13
3.1 Chinese Spelling Checking Model 14
3.2 Partial Diffusion Model 19
IV.Experiments and Evaluation 25
4.1 Evaluation Metric and Baseline Models 27
4.2 Character-level Performance Comparison 29
4.3 Sentence-level Performance Comparison 31
4.4 The Importance of Mixing Traditional and Simplified Chinese 35
4.5 Generative Capability of the Diffusion Model 36
4.6 Ablation Study 38
4.7 Parameter Setting 39
4.8 Case Study 43
V. Conclusion 45
Reference 46

參考文獻

[1] C. Chang, “A new approach for automatic Chinese spelling correction,” In Proceedings of Natural Language Processing Pacific Rim Symposium, Vol. 95, pp. 278-283, 1995.
[2] J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirec-tional Transformers for Language Understanding,” In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1, pp. 4171–4186, 2019.
[3] S. Zhang, H. Huang, J. Liu, and H. Li, “Spelling Error Correction with Soft-Masked BERT,” In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 882–890, 2020.
[4] X. Zhang, Y. Zheng, H. Yan, and X. Qiu, “Investigating Glyph Phonetic Information for Chinese Spell Checking: What Works and What’s Next,” arXiv preprint arXiv:2212.04068, 2022.
[5] F. Li, Y. Shan, J. Duan, X. Mao, and M. Huang, “WSpeller: Robust Word Segmentation for Enhancing Chinese Spelling Check.” In Findings of the Association for Computational Linguistics: EMNLP, pp. 1179–1188, 2022.
[6] Y. Hsieh, M. Bai, S. Huang, and K. Chen, “Correcting Chinese Spelling Errors with Word Lattice Decoding,” ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), vol. 14, no. 4, 2015.
[7] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. Yu, “A Comprehensive Survey on Graph Neural Networks,” In IEEE transactions on neural networks and learning systems, vol. 32, no. 1, pp. 4–24, 2019.
[8] X. Cheng, W. Xu, K. Chen, S. Jiang, F. Wang, T. Wang, W. Chu, and Y. Qi, “SpellGCN: Incorporating Phonological and Visual Similarities into Language Models for Chinese Spelling Check,” In Proceedings of the 58th Annual Meeting of the Association for Com-putational Linguistics, pp. 871–881, 2020.
[9] T. Ji, H. Yan, and X. Qiu, “SpellBERT: A Lightweight Pretrained Model for Chinese Spelling Check,” In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 3544–3551, 2021.
[10] X. Zhang, H. Yan, Y. Sun, and X. Qiu, “SDCL: Self-Distillation Contrastive Learning for Chinese Spell Checking,” arXiv preprint arXiv:2210.17168, 2022.
[11] D. Zhang, Y. Li, Q. Zhou, S. Ma, Y. Li, Y. Cao, and H. Zheng, “Contextual Similarity is More Valuable Than Character Similarity: An Empirical Study for Chinese Spell Checking,” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5, 2023.
[12] Q. Zhao, X. Shen, and J. Yao, “IME-Spell: Chinese Spelling Check based on Input Method,” In Proceedings of the 4th International Conference on Natural Language Processing and Information Retrieval NLPIR ′20, pp. 85–90, 2020.
[13] L. Huang, J. Li, W. Jiang, Z. Zhang, M. Chen, S. Wang, and J. Xiao, “PHMOSpell: Pho-nological and Morphological Knowledge Guided Chinese Spelling Check,” In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 5958–5967, 2021.
[14] J. Liang, W. Huang, F. Li, and Q. Shi, “DUKE: Distance Fusion and Knowledge Enhanced Framework for Chinese Spelling Check,” In Proceedings of Euro-Asia Conference on Frontiers of Computer Science and Information Technology, FCSIT, pp. 1–5, 2022.
[15] H. Xu, Z. Li, Q. Zhou, C. Li, Z. Wang, Y. Cao, H. Huang, and X. Mao, “Read, Listen, and See: Leveraging Multimodal Information Helps Chinese Spell Checking,” In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 716–728, 2021.
[16] Y. Hong, X. Yu, N. He, N. Liu, and J. Liu, “FASPell: A Fast, Adaptable, Simple, Powerful Chinese Spell Checker Based On DAE-Decoder Paradigm,” In Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019), pp. 160–169, 2019.
[17] Y. Li, Q. Zhou, Y. Li, Z. Li, R. Liu, R. Sun, Z. Wang, C. Li, Y. Cao, and H. Zheng., “The Past Mistake is the Future Wisdom: Error-driven Contrastive Probability Optimization for Chinese Spell Checking,” In Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp. 3202–3213, 2022.
[18] Y. Li, S. Ma, Q. Zhou, Z. Li, Y. Li, S. Huang, R. Liu, C. Li, Y. Cao, and H. Zheng, “Learning from the Dictionary: Heterogeneous Knowledge Guided Fine-tuning for Chinese Spell Checking.” In Findings of the Association for Computational Linguistics: EMNLP, pp. 238–249, 2022.
[19] J. Li, Q. Wang, Z. Mao, J. Guo, Y. Yang, and Y. Zhang, “Improving Chinese Spelling Check by Character Pronunciation Prediction: The Effects of Adaptivity and Granularity.” In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 4275–4286, 2022.
[20] S. Liu, T. Yang, T. Yue, F. Zhang, and D. Wang, “PLOME: Pre-training with Misspelled Knowledge for Chinese Spelling Correction,” In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pp. 2991–3000, 2021.
[21] Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, L Kaiser, and I. Po-losukhin, “Attention Is All You Need,” Advances in neural information processing systems, pp. 5999–6009, 2017.
[22] Hinson, H. Huang, and H. Chen, “Heterogeneous Recycle Generation for Chinese Grammatical Error Correction,” In Proceedings of the 28th International Conference on Computational Linguistics, pp. 2191–2201, 2020.
[23] X. Wu, Y. Wu, “From Spelling to Grammar: A New Framework for Chinese Grammatical Error Correction.” In Findings of the Association for Computational Linguistics: EMNLP 2022, pp. 889–902, 2022.
[24] Z. Qiu and Y. Qu, “A Two-Stage Model for Chinese Grammatical Error Correction,” in IEEE Access, pp. 146772–146777, 2019.
[25] J. Ye, Y. Li, S. Ma, R. Xie, W. Wu, and H. Zheng, “Focus Is What You Need For Chinese Grammatical Error Correction,” arXiv preprint arXiv:2210.12692, 2022.
[26] F. Gu, Z. Wang, “Chinese grammatical error correction based on the BERT-BiLSTM-CRF model,” In Third International Conference on Machine Learning and Computer Application (ICMLCA 2022), vol. 12636, pp. 559–564, 2023.
[27] H. Xu, C. He, C. Zhang, Z. Tan, S. Hu, and B. Ge, “A Multi-channel Chinese Text Cor-rection Method Based on Grammatical Error Diagnosis,” In 2022 8th International Con-ference on Big Data and Information Analytics (BigDIA), pp. 396–401, 2022.
[28] C. Li, J. Zhou, Z. Bao, H. Liu, G. Xu, and L. Li, “A Hybrid System for Chinese Gram-matical Error Diagnosis and Correction,” In Proceedings of the 5th Workshop on Natural Language Processing Techniques for Educational Applications, pp. 60–69, 2018.
[29] G. Rao, E. Yang, and B. Zhang, “Overview of NLPTEA-2020 Shared Task for Chinese Grammatical Error Diagnosis,” In Proceedings of the 6th Workshop on Natural Language Processing Techniques for Educational Applications, pages 25–35, 2020.
[30] Y. Luo, Z. Bao, C. Li, and R. Wang, “Chinese Grammatical Error Diagnosis with Graph Convolution Network and Multi-task Learning.” In Proceedings of the 6th Workshop on Natural Language Processing Techniques for Educational Applications, pp. 44–48, 2020.
[31] Y. Cheng, M. Duan, “Chinese Grammatical Error Detection Based on BERT Model.” In Proceedings of the 6th Workshop on Natural Language Processing Techniques for Educa-tional Applications, pp. 108–113, 2020.
[32] S. Wang, B. Wang, J. Gong, Z. Wang, X. Hu, X. Duan, Z. Shen, G. Yue, R. Fu, D. Wu, W. Che, S. Wang, G. Hu, and T. Liu, “Combining ResNet and Transformer for Chinese Grammatical Error Diagnosis.” In Proceedings of the 6th Workshop on Natural Language Processing Techniques for Educational Applications, pp. 36–43, 2020.
[33] J. Ho, A. Jain, and P. Abbeel, “Denoising Diffusion Probabilistic Models,” Advances in Neural Information Processing Systems (NeurIPS, 2020), 2020.
[34] Nichol and P. Dhariwal, “Improved Denoising Diffusion Probabilistic Models.” In Pro-ceedings of the 38th International Conference on Machine Learning (PMLR), pp. 8162–8171, 2021.
[35] H. Sasaki, C. Willcocks, and T. Breckon, “UNIT-DDPM: UNpaired Image Translation with Denoising Diffusion Probabilistic Models,” arXiv preprint arXiv:2104.05358, 2021.
[36] T. Amit, T. Shaharbany, E. Nachmani, and L. Wolf, “SegDiff: Image Segmentation with Diffusion Probabilistic Models,” arXiv preprint arXiv:2112.00390, 2021.
[37] L. Rout, A. Parulekar, C. Caramanis, and S. Shakkottai, “A Theoretical Justification for Image Inpainting using Denoising Diffusion Probabilistic Models,” arXiv preprint arXiv:2302.01217, 2023.
[38] R. Yang, P. Srivastava, and S. Mandt, “Diffusion Probabilistic Modeling for Video Gener-ation,” arXiv preprint arXiv:2203.09481, 2022.
[39] Popov, I. Vovk, V. Gogoryan, T. Sadekova, and M. Kudinov, “Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech.” In Proceedings of the 38th International Confer-ence on Machine Learning (PMLR), pp. 8599–8608, 2021.
[40] Z. Chen, Y. Wu, Y. Leng, J. Chen, H. Liu, X. Tan, Y. Cui, K. Wang, L. He, S. Zhao, J. Bian, and D. Mandic “ResGrad: Residual Denoising Diffusion Probabilistic Models for Text to Speech,” arXiv preprint arXiv:2212.14518, 2022.
[41] R. Huang, Z. Zhao, H. Liu, J. Liu, C. Cui, and Y. Ren, “ProDiff: Progressive Fast Diffusion Model for High-Quality Text-to-Speech,” In Proceedings of the 30th ACM International Conference on Multimedia, pp. 2595–2605, 2022.
[42] S. Gong, M. Li, J. Feng, Z. Wu, and L. Kong, “DiffuSeq: Sequence to Sequence Text Generation with Diffusion Models,” arXiv preprint arXiv:2210.08933, 2022.
[43] X. Li, J. Thickstun, I. Gulrajani, P. Liang, and T. Hashimoto, “Diffusion-LM Improves Controllable Text Generation,” Advances in Neural Information Processing Systems (NeurIPS, 2022), 2022.
[44] Z. Lin, Y. Gong, Y. Shen, T. Wu, Z. Fan, C. Lin, N. Duan, and W. Chen, “Text Generation with Diffusion Language Models: A Pre-training Approach with Continuous Paragraph Denoise,” In Proceedings of the 40th International Conference on Machine Learning, 2023.
[45] Z. Gao, J. Guo, X. Tan, Y. Zhu, F. Zhang, J. Bian, and L. Xu, “Difformer: Empowering Diffusion Models on the Embedding Space for Text Generation,” arXiv preprint arXiv:2212.09412, 2022.
[46] S. Wu, C. Liu, and L. Lee, “Chinese Spelling Check Evaluation at SIGHAN Bake-off 2013.” In Proceedings of the Seventh SIGHAN Workshop on Chinese Language Processing, pp. 35–42, 2013.
[47] L. Yu, L. Lee, Y. Tseng, and H. Chen, “Overview of SIGHAN 2014 Bake-off for Chinese Spelling Check,” In Proceedings of the Third CIPS-SIGHAN Joint Conference on Chinese Language Processing, pp. 126–132, 2014.
[48] Y. Tseng, L. Lee, L. Chang, and H. Chen, “Introduction to SIGHAN 2015 Bake-off for Chinese Spelling Check,” In Proceedings of the Eighth SIGHAN Workshop on Chinese Language Processing, pp. 32–37, 2015.
[49] “遠傳電信FETnet,” https://www.fetnet.net/
[50] D. Wang, Y. Song, J. Li, J. Han, and H. Zhang, “A Hybrid Approach to Automatic Corpus Generation for Chinese Spelling Check.” In Proceedings of the 2018 Conference on Em-pirical Methods in Natural Language Processing, pages 2517–2527, 2018.

指導教授

陳以錚陳振明(Yi-Cheng Chen Jen-Ming Chen)

審核日期

2024-7-17

推文