摘要(英) |
In recent years, deep learning has emerged as a popular research direction in the field of artificial intelligence. Deep learning leverages multi-layer neural networks to learn features and patterns from vast amounts of data, generating highly accurate predictions and classifications. It has been successfully applied in various domains, including speech recognition, image recognition, natural language processing, and has become a significant driving force in the advancement of artificial intelligence.
This paper focuses on applying deep learning to lip reading, utilizing deep learning training techniques to analyze the shape and motion variations of the lips during speech in order to recognize spoken words. The MIRACL-VC1 Dataset is used as the sample dataset. Deep learning techniques, specifically Convolutional Neural Networks (CNN), are employed to extract lip features, followed by training with both Long Short-Term Memory (LSTM) and Bidirectional Long Short-Term Memory (BiLSTM) models. The phrase accuracy of these models is compared. Through appropriate data preprocessing techniques such as time series normalization and parameter adjustment, experimental results demonstrate that the ResNet152 model consistently exhibits superior performance. Particularly, the highest accuracy is achieved when ResNet152 is combined with BiLSTM.
In summary, this paper explores the application of deep learning to lip reading, employing deep learning techniques to analyze lip shape and motion during speech for speech recognition. The MIRACL-VC1 Dataset is used, and lip features are extracted using a Convolutional Neural Network (CNN). Training is performed with LSTM and BiLSTM models. By employing suitable data preprocessing techniques and parameter adjustments, experimental results consistently highlight the superior performance of the ResNet152 model, particularly when combined with BiLSTM. |
參考文獻 |
[1]邱建晴(2016)。以卷積神經網路分析部落格社群網站垃圾文章。﹝碩士論文。國立臺灣大學﹞臺灣博碩士論文知識加值系統。 https://hdl.handle.net/11296/6ff442。
[2]洪文麟(2016)。深度學習應用於以影像辨識為基礎的個人化推薦系統-以服飾樣式為例。﹝碩士論文。國立成功大學﹞臺灣博碩士論文知識加值系統。 https://hdl.handle.net/11296/n7425w。
[3]林予凡(2022)。結合CNN-LSTM神經網路估測鋰離子電池之健康狀態與殘電量。﹝碩士論文。大同大學﹞臺灣博碩士論文知識加值系統。 https://hdl.handle.net/11296/wsvt5t。
[4]林育如(2015)。數字唇語之辨識與應用。﹝碩士論文。國立東華大學﹞臺灣博碩士論文知識加值系統。 https://hdl.handle.net/11296/57dtqf。
[5]Deshmukh, N., Ahire, A., Bhandari, S. H., Mali, A., & Warkari, K. (2021). Vision based Lip Reading System using Deep Learning. 2021 International Conference on Computing, Communication and Green Engineering (CCGE), 1–6. https://doi.org/10.1109/CCGE50943.2021.9776430
[6]Fung, I., & Mak, B. (2018). End-To-End Low-Resource Lip-Reading with Maxout Cnn and Lstm. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2511–2515. https://doi.org/10.1109/ICASSP.2018.8462280
[7]Ghaleh, V. E. C., & Behrad, A. (2010). Lip contour extraction using RGB color space and fuzzy c-means clustering. 2010 IEEE 9th International Conference on Cyberntic Intelligent Systems, 1–4. https://doi.org/10.1109/UKRICIS.2010.5898135
[8]Huang, Y., Liang, J., Pan, B., & Fan, X. (2010). A new lip-automatic detection and location algorithm in lip-reading system. 2010 IEEE International Conference on Systems, Man and Cybernetics, 2402–2405. https://doi.org/10.1109/ICSMC.2010.5641954
[9]沈育璋(2023)。應用 CNN與 機器學習 模式進行 UAV水稻 田影像判釋精度差異之研究。﹝碩士論文。逢甲大學﹞臺灣博碩士論文知識加值系統。 https://hdl.handle.net/11296/sdrc38。
[10]陳柏安(2022)。應用Mask R-CNN與SVM於無人機多光譜影像之青花菜成熟度分類。﹝碩士論文。國立中興大學﹞臺灣博碩士論文知識加值系統。 https://hdl.handle.net/11296/26abew。
[11]何文翔(2018)。以SVM分類器辨識人體舞姿之研究。﹝碩士論文。國立臺灣海洋大學﹞臺灣博碩士論文知識加值系統。 https://hdl.handle.net/11296/n3zkb8。
[12]鍾明軒(2017)。基於HOG演算法及SVM分類器之行人偵測技術。﹝碩士論文。南臺科技大學﹞臺灣博碩士論文知識加值系統。 https://hdl.handle.net/11296/a3m28c。
[13]林佳姿(2020)。搭配類神經CNN、LSTM及DNN方法於高混合度之母音辨識。﹝碩士論文。國立中興大學﹞臺灣博碩士論文知識加值系統。 https://hdl.handle.net/11296/ed99mu。
[14]蔡名彥(2021)。基於深度學習之人臉膚質檢測。﹝碩士論文。南臺科技大學﹞臺灣博碩士論文知識加值系統。 https://hdl.handle.net/11296/5xadrt
[15]孫崧瑋(2023)。智慧化語意分割辨識農耕地景多樣性。﹝碩士論文。國立雲林科技大學﹞臺灣博碩士論文知識加值系統。 https://hdl.handle.net/11296/mue2tw。
[16]郭豐瑋(2022)。基於LSTM網路之迴轉式起重機運動預測。﹝碩士論文。國立陽明交通大學﹞臺灣博碩士論文知識加值系統。 https://hdl.handle.net/11296/z5bf5a。
[17]鄧凱中(2020)。LSTM 法則應用於連續手勢辨識之研究──手勢辨識系統軟體與硬體於 FPGA 實作。﹝碩士論文。國立臺灣師範大學﹞臺灣博碩士論文知識加值系統。 https://hdl.handle.net/11296/rnhzyk。
[18]邱景鴻(2022)。基於BiLSTM模型的音樂類別分析。﹝碩士論文。逢甲大學﹞臺灣博碩士論文知識加值系統。 https://hdl.handle.net/11296/k6438n。
[19]李子昂(2022)。基於CNN-BiLSTM-Attention網路模型預測貨櫃吞吐量。﹝碩士論文。國立高雄科技大學﹞臺灣博碩士論文知識加值系統。 https://hdl.handle.net/11296/wkf37s。
[20]Bashier, I. H., Mosa, M., & Babikir, S. F. (2021). Sesame Seed Disease Detection Using Image Classification. 2020 International Conference on Computer, Control, Electrical, and Electronics Engineering (ICCCEEE), 1–5. https://doi.org/10.1109/ICCCEEE49695.2021.9429640
[21]Chung, J. S., Senior, A., Vinyals, O., & Zisserman, A. (2017). Lip Reading Sentences in the Wild. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 3444–3453. https://doi.org/10.1109/CVPR.2017.367
[22]CNN 基礎與概念. (2021, 十月 31). 知勢 - 提供AI新知與觀點的媒體. https://edge.aif.tw/about-cnn/
[23]CS231n Convolutional Neural Networks for Visual Recognition. (不詳). 讀取於 2023年5月20日, 從 https://cs231n.github.io/convolutional-networks/
[24]Deshpande, A. Adit Deshpande – Engineering at Forward | UCLA CS ’19. 讀取於 2023年5月20日, 從 https://adeshpande3.github.io/
[25]He, K., Zhang, X., Ren, S., & Sun, J. (2015). Deep Residual Learning for Image Recognition (arXiv:1512.03385). arXiv. http://arxiv.org/abs/1512.03385
[26]iThome.Day 09:CNN 經典模型應用. iT 邦幫忙::一起幫忙解決難題,拯救 IT 人的一天. 讀取於 2023年5月20日, 從 https://ithelp.ithome.com.tw/articles/10192162
[27]James, Y. (2021, 六月 26). [資料分析&機器學習] 第5.1講: 卷積神經網絡介紹(Convolutional Neural Network). JamesLearningNote. https://medium.com/jameslearningnote/資料分析-機器學習-第5-1講-卷積神經網絡介紹-convolutional-neural-network-4f8249d65d4f
[28]KevinLuo. (2022, 二月 16). 好用的深度學習CNN預訓練模型框架總整理: 從AlexNet到EfficientNet(ML 隨筆). Medium. https://kilong31442.medium.com/好用的深度學習cnn預訓練模型框架總整理-從alexnet到efficientnet-ml-隨筆-f2ccb7a65621
[29]Saha S. (2022,十一月16). A Comprehensive Guide to Convolutional Neural Networks—The ELI5 way.Medium. https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53
[30]Simonyan, K., & Zisserman, A. (2015). Very Deep Convolutional Networks for Large-Scale Image Recognition (arXiv:1409.1556). arXiv. http://arxiv.org/abs/1409.1556
[31]Sindhura, P., Preethi, S. J., & Niranjana, K. B. (2018). Convolutional Neural Networks for Predicting Words: A Lip-Reading System. 2018 International Conference on Electrical, Electronics, Communication, Computer, and Optimization Techniques (ICEECCOT), 929–933. https://doi.org/10.1109/ICEECCOT43722.2018.9001505 |