dc.description.abstract | In recent years, deep learning has emerged as a popular research direction in the field of artificial intelligence. Deep learning leverages multi-layer neural networks to learn features and patterns from vast amounts of data, generating highly accurate predictions and classifications. It has been successfully applied in various domains, including speech recognition, image recognition, natural language processing, and has become a significant driving force in the advancement of artificial intelligence.
This paper focuses on applying deep learning to lip reading, utilizing deep learning training techniques to analyze the shape and motion variations of the lips during speech in order to recognize spoken words. The MIRACL-VC1 Dataset is used as the sample dataset. Deep learning techniques, specifically Convolutional Neural Networks (CNN), are employed to extract lip features, followed by training with both Long Short-Term Memory (LSTM) and Bidirectional Long Short-Term Memory (BiLSTM) models. The phrase accuracy of these models is compared. Through appropriate data preprocessing techniques such as time series normalization and parameter adjustment, experimental results demonstrate that the ResNet152 model consistently exhibits superior performance. Particularly, the highest accuracy is achieved when ResNet152 is combined with BiLSTM.
In summary, this paper explores the application of deep learning to lip reading, employing deep learning techniques to analyze lip shape and motion during speech for speech recognition. The MIRACL-VC1 Dataset is used, and lip features are extracted using a Convolutional Neural Network (CNN). Training is performed with LSTM and BiLSTM models. By employing suitable data preprocessing techniques and parameter adjustments, experimental results consistently highlight the superior performance of the ResNet152 model, particularly when combined with BiLSTM. | en_US |