姓名 黃靖筌(Ching-Chuan Huang)  查詢紙本館藏   畢業系所 資訊工程學系
(CA-Wav2Lip: Coordinate Attention-based Speech to Lip Synthesis in the Wild)
摘要(中) 隨著線上媒體需求的不斷增長,媒體創作者為了接觸到來自世界各地的更多觀眾,迫切需要影片內容的翻譯。
摘要(英) With the growing consumption of online visual contents, there is an urgent need for video translation in order to reach a wider audience from around the world.
However, the materials after direct translation and dubbing are unable to create a natural audio-visual experience since the translated speech and lip movement are often out of sync.
To improve viewing experience, an accurate automatic lip-movement synchronization generation system is necessary.
To improve the accuracy and visual quality of speech to lip generation, this research proposes two techniques: Embedding Attention Mechanisms in Convolution Layers and Deploying SSIM as Loss Function in Visual Quality Discriminator.
The proposed system as well as several other ones are experimented on three audio-visual datasets. The results show that our proposed methods achieve superior performance than the state-of-the-art speech to lip synthesis on not only the accuracy but also the visual quality of audio-lip synchronization generation.
關鍵字(中) ★ 注意力機制
★ 唇形同步
★ 臉部生成
關鍵字(英) ★ attention mechanism
★ lip synchronization
★ face synthesis
論文目次 Contents
1 Introduction 1
2 Related Work 4
2.1 Text-driven Talking Face Generation . . . . . . . . . . . . . . . . . . . . . 4
2.2 Audio-driven Talking Face Generation . . . . . . . . . . . . . . . . . . . . 5
2.3 Video-driven Talking Face Generation . . . . . . . . . . . . . . . . . . . . . 7
3 Preliminary 8
3.1 Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2 Convolutional Encoder-decoder . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3 Residual Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.4 Generative Adversarial Network (GAN) . . . . . . . . . . . . . . . . . . . . 10
3.5 SyncNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.5.1 Pseudo-siamese Network . . . . . . . . . . . . . . . . . . . . . . . . 11
3.6 Wav2Lip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.7 Attention Mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.8 Structural Similarity Index measurement (SSIM) . . . . . . . . . . . . . . . 15
4 Design 17
4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.3 Research Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.4 Proposed System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.4.1 Video Preprocess . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.4.2 Audio Preprocess . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.4.3 Lip Sync Discriminator . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.4.4 Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.4.5 Visual Quality Discriminator . . . . . . . . . . . . . . . . . . . . . . 26
5 Performance 28
5.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.4 Experimental Results and Analysis . . . . . . . . . . . . . . . . . . . . . . 31
5.4.1 Lip Sync Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.4.2 SSIM and MS-SSIM . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.4.3 Fine-tuning of the Weights of the Loss Functions . . . . . . . . . . 34
5.5 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.5.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.5.2 Visualization of Attention Feature Maps . . . . . . . . . . . . . . . 38
6 Conclusion 40
指導教授 孫敏德(Min-Te Sun) 審核日期 2022-7-25
