CA-Wav2Lip: Coordinate Attention-based Speech to Lip Synthesis in the Wild

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：70

、訪客IP：3.135.219.165

姓名

黃靖筌(Ching-Chuan Huang) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

(CA-Wav2Lip: Coordinate Attention-based Speech to Lip Synthesis in the Wild)

相關論文

★ Dynamic Overlay Construction for Mobile Target Detection in Wireless Sensor Networks	★ 車輛導航的簡易繞路策略
★ 使用傳送端電壓改善定位	★ 利用車輛分類建構車載網路上的虛擬骨幹
★ Why Topology-based Broadcast Algorithms Do Not Work Well in Heterogeneous Wireless Networks?	★ 針對移動性目標物的有效率無線感測網路
★ 適用於無線隨意網路中以關節點為基礎的分散式拓樸控制方法	★ A Review of Existing Web Frameworks
★ 將感測網路切割成貪婪區塊的分散式演算法	★ 無線網路上Range-free的距離測量
★ Inferring Floor Plan from Trajectories	★ An Indoor Collaborative Pedestrian Dead Reckoning System
★ Dynamic Content Adjustment In Mobile Ad Hoc Networks	★ 以影像為基礎的定位系統
★ 大範圍無線感測網路下分散式資料壓縮收集演算法	★ 車用WiFi網路中的碰撞分析

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

隨著線上媒體需求的不斷增長，媒體創作者為了接觸到來自世界各地的更多觀眾，迫切需要影片內容的翻譯。
然而，經過直接翻譯和配音的素材無法提供自然的視聽體驗，往往是因為翻譯後的語音和唇部動作不同步。
為了改善觀看體驗，準確的自動唇部動作同步生成系統有了它的必要性。
為了提高語音到嘴唇生成的準確性和視覺品質，本研究提出了兩種做法：在卷積層中嵌入註意力機制，以及在視覺品質判別器中部署SSIM作為損失函數。
最後在三個視聽資料集上對所提出的系統以及過往的系統進行了實驗。
結果表明，我們提出的方法不僅在音頻-嘴唇同步生成的準確度上，而且也在其視覺品質上，都比目前領域中最先進的語音-嘴唇合成系統有更佳的表現。

摘要(英)

With the growing consumption of online visual contents, there is an urgent need for video translation in order to reach a wider audience from around the world.
However, the materials after direct translation and dubbing are unable to create a natural audio-visual experience since the translated speech and lip movement are often out of sync.
To improve viewing experience, an accurate automatic lip-movement synchronization generation system is necessary.
To improve the accuracy and visual quality of speech to lip generation, this research proposes two techniques: Embedding Attention Mechanisms in Convolution Layers and Deploying SSIM as Loss Function in Visual Quality Discriminator.
The proposed system as well as several other ones are experimented on three audio-visual datasets. The results show that our proposed methods achieve superior performance than the state-of-the-art speech to lip synthesis on not only the accuracy but also the visual quality of audio-lip synchronization generation.

關鍵字(中)

★ 注意力機制
★ 唇形同步
★ 臉部生成

關鍵字(英)

★ attention mechanism
★ lip synchronization
★ face synthesis

論文目次

Contents
1 Introduction 1
2 Related Work 4
2.1 Text-driven Talking Face Generation . . . . . . . . . . . . . . . . . . . . . 4
2.2 Audio-driven Talking Face Generation . . . . . . . . . . . . . . . . . . . . 5
2.3 Video-driven Talking Face Generation . . . . . . . . . . . . . . . . . . . . . 7
3 Preliminary 8
3.1 Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2 Convolutional Encoder-decoder . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3 Residual Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.4 Generative Adversarial Network (GAN) . . . . . . . . . . . . . . . . . . . . 10
3.5 SyncNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.5.1 Pseudo-siamese Network . . . . . . . . . . . . . . . . . . . . . . . . 11
3.6 Wav2Lip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.7 Attention Mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.8 Structural Similarity Index measurement (SSIM) . . . . . . . . . . . . . . . 15
4 Design 17
4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.3 Research Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.4 Proposed System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.4.1 Video Preprocess . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.4.2 Audio Preprocess . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.4.3 Lip Sync Discriminator . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.4.4 Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.4.5 Visual Quality Discriminator . . . . . . . . . . . . . . . . . . . . . . 26
5 Performance 28
5.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.4 Experimental Results and Analysis . . . . . . . . . . . . . . . . . . . . . . 31
5.4.1 Lip Sync Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.4.2 SSIM and MS-SSIM . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.4.3 Fine-tuning of the Weights of the Loss Functions . . . . . . . . . . 34
5.5 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.5.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.5.2 Visualization of Attention Feature Maps . . . . . . . . . . . . . . . 38
6 Conclusion 40

參考文獻

[1] Triantafyllos Afouras, Joon Son Chung, Andrew W. Senior, Oriol Vinyals, and An-
drew Zisserman. Deep audio-visual speech recognition. IEEE transactions on pattern
analysis and machine intelligence, 2018.
[2] Triantafyllos Afouras, Joon Son Chung, and Andrew Zisserman. Lrs3-ted: a large-
scale dataset for visual speech recognition. ArXiv, abs/1809.00496, 2018.
[3] Hyeon-woo An and Nammee Moon. Design of recommendation system for tourist
spot using sentiment analysis based on cnn-lstm. Journal of Ambient Intelligence
and Humanized Computing, pages 1–11, 2019.
[4] Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. Segnet: A deep convolu-
tional encoder-decoder architecture for image segmentation. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 39:2481–2495, 2017.
[5] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine transla-
tion by jointly learning to align and translate. CoRR, abs/1409.0473, 2015.
[6] John Brandon. New survey says we’re spending 7 hours per day consum-
ing online media. https://www.forbes.com/sites/johnbbrandon/2020/11/17/
new-survey-says-were-spending-7-hours-per-day-consuming-online-media/
?sh=150a8f416b46. Accessed: 2022-03-22.
[7] Lele Chen, Ross Maddox, Zhiyao Duan, and Chenliang Xu. Hierarchical cross-modal
talking face generation with dynamic pixel-wise loss. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, 06 2019.
[8] Lilin Cheng, Suzhe Wang, Zhimeng Zhang, Yu Ding, Yixing Zheng, Xin Yu, and
Changjie Fan. Write-a-speaker: Text-based emotional and rhythmic talking-head
generation. In AAAI, 2021.
[9] J. S. Chung and A. Zisserman. Out of time: automated lip sync in the wild. In
Workshop on Multi-view Lip-reading, ACCV, 2016.
[10] Joon Son Chung, Amir Jamaludin, and Andrew Zisserman. You said that? ArXiv,
2017.
[11] Joon Son Chung, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. Lip reading
sentences in the wild. 2017 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), Jul 2017.
[12] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua
Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold,
Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words:
Transformers for image recognition at scale. ArXiv, abs/2010.11929, 2021.
[13] A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K Wilson, A. Hassidim, W. T. Freeman,
and M. Rubinstein. Looking to listen at the cocktail party: A speaker-independent
audio-visual model for speech separation. arXiv preprint arXiv:1804.03619, 2018.
[14] Ohad Fried, Ayush Tewari, Michael Zollh ̈ofer, Adam Finkelstein, Eli Shechtman,
Dan B. Goldman, Kyle Genova, Zeyu Jin, Christian Theobalt, and Maneesh
Agrawala. Text-based editing of talking-head video. ACM Transactions on Graphics
(TOG), 38:1 – 14, 2019.
[15] Matthias Funk. How many youtube channels are there? https://www.tubics.com/
blog/number-of-youtube-channels. Accessed: 2022-07-01.
[16] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley,
Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. Generative adversarial nets.
In NIPS, 2014.
[17] Md Rashidul Hasan, Mustafa Jamil, MGRMS Rahman, et al. Speaker identification
using mel frequency cepstral coefficients. variations, 1(4):565–568, 2004.
[18] Kaiming He, X. Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image
recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), pages 770–778, 2016.
[19] Qibin Hou, Daquan Zhou, and Jiashi Feng. Coordinate attention for efficient mo-
bile network design. 2021 IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), pages 13708–13717, 2021.
[20] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. 2018 IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pages 7132–7141, 2018.
[21] Viktor Igeland, Examensarbete utf ̈ort, Handledare Gabriel Eilertsen, and Examina-
tor Jonas Unger. Generating facial animation with emotions in a neural text-to-speech pipeline, 2019.
[22] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. Image-to-image
translation with conditional adversarial networks. CoRR, 2016.
[23] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. Image-to-image trans-lation with conditional adversarial networks. 2017 IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), pages 5967–5976, 2017.
[24] Amir Jamaludin, Joon Son Chung, and Andrew Zisserman. You said that?: Syn-
thesising talking faces from audio. International Journal of Computer Vision, pages
1–13, 2019.
[25] Prajwal K R, Rudrabha Mukhopadhyay, Jerin Philip, Abhishek Jha, Vinay Nam-
boodiri, and C V Jawahar. Towards automatic face-to-face translation. Proceedings
of the 27th ACM International Conference on Multimedia, Oct 2019.
[26] Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. A convolutional neural
network for modelling sentences. In ACL, 2014.
[27] Tero Karras, Miika Aittala, Samuli Laine, Erik H ̈ark ̈onen, Janne Hellsten, Jaakko
Lehtinen, and Timo Aila. Alias-free generative adversarial networks. In Proc.
NeurIPS, 2021.
[28] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and
Timo Aila. Analyzing and improving the image quality of StyleGAN. In Proc.
CVPR, 2020.
[29] Hyeongwoo Kim, Pablo Garrido, Ayush Tewari, Weipeng Xu, Justus Thies, Matthias
Nießner, Patrick P ́erez, Christian Richardt, Michael Zoll ̈ofer, and Christian Theobalt.
Deep video portraits. ACM Transactions on Graphics (TOG), 37(4):163, 2018.
[30] Yoon Kim. Convolutional neural networks for sentence classification. Proceedings of
the 2014 Conference on Empirical Methods in Natural Language Processing, 08 2014.
[31] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization.
CoRR, 2015.
[32] Rithesh Kumar, Jose M. R. Sotelo, Kundan Kumar, Alexandre de Br ́ebisson, and
Yoshua Bengio. Obamanet: Photo-realistic lip-sync from text. ArXiv, 2018.
[33] Christian Ledig, Lucas Theis, Ferenc Husz ́ar, Jose Caballero, Andrew P. Aitken,
Alykhan Tejani, Johannes Totz, Zehan Wang, and Wenzhe Shi. Photo-realistic single
image super-resolution using a generative adversarial network. 2017 IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), pages 105–114, 2017.
[34] Jinglin Liu, Zhiying Zhu, Yi Ren, Wencan Huang, Baoxing Huai, Nicholas Jing Yuan,
and Zhou Zhao. Parallel and high-fidelity text-to-lip generation. 2020 AAAI, 2021.
[35] Ze Lu, Xudong Jiang, and Alex Kot. Deep coupled resnet for low-resolution face
recognition. IEEE Signal Processing Letters, 25(4):526–530, 2018.
[36] Xiaojiao Mao, Chunhua Shen, and Yu-Bin Yang. Image restoration using very deep
convolutional encoder-decoder networks with symmetric skip connections. Advances
in neural information processing systems, 29, 2016.
[37] Takashi Masuko, Takao Kobayashi, Masatsune Tamura, Jun Masubuchi, and Keiichi
Tokuda. Text-to-visual speech synthesis based on parameter generation from hmm.
Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and
Signal Processing, ICASSP ’98 (Cat. No.98CH36181), 6:3745–3748 vol.6, 1998.
[38] Xianfeng Ou, Pengcheng Yan, Yiming Zhang, Bing Tu, Guoyun Zhang, Jianhui Wu,
and Wujing Li. Moving object detection method via resnet-18 with encoder–decoder
structure in complex scenes. IEEE Access, 7:108152–108160, 2019.
[39] Karol J. Piczak. Environmental sound classification with convolutional neural net-
works. In 2015 IEEE 25th International Workshop on Machine Learning for Signal
Processing (MLSP), pages 1–6, 2015.
[40] Prajwal K R, Rudrabha Mukhopadhyay, Vinay Namboodiri, and C. V. Jawahar. A
lip sync expert is all you need for speech to lip generation in the wild. Proceedings
of the 28th ACM International Conference on Multimedia, 2020.
[41] Shinji Sako, Keiichi Tokuda, Takashi Masuko, Takao Kobayashi, and Tadashi Kita-
mura. Hmm-based text-to-audio-visual speech synthesis. In INTERSPEECH, 2000.
[42] Fuhao Shi, Hsiang-Tao Wu, Xin Tong, and Jinxiang Chai. Automatic acquisition
of high-fidelity facial performances using monocular videos. ACM Transactions on
Graphics (TOG), 33:1 – 13, 2014.
[43] Jose M. R. Sotelo, Soroush Mehri, Kundan Kumar, Jo ̃ao Felipe Santos, Kyle Kastner,
Aaron C. Courville, and Yoshua Bengio. Char2wav: End-to-end speech synthesis. In
ICLR, 2017.
[44] Supasorn Suwajanakorn, Steven M. Seitz, and Ira Kemelmacher-Shlizerman. Syn-
thesizing obama. ACM Transactions on Graphics (TOG), 36:1 – 13, 2017.
[45] Justus Thies, Mohamed A. Elgharib, Ayush Tewari, Christian Theobalt, and
Matthias Nießner. Neural voice puppetry: Audio-driven facial reenactment. ArXiv,
2020.
[46] Justus Thies, Michael Zollh ̈ofer, Marc Stamminger, Christian Theobalt, and
Matthias Nießner. Face2face: Real-time face capture and reenactment of rgb videos.
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages
2387–2395, 2016.
[47] Justus Thies, Michael Zollh ̈ofer, Marc Stamminger, Christian Theobalt, and
Matthias Nießner. Face2face: real-time face capture and reenactment of rgb videos.
ArXiv, 2019.
[48] Ashish Vaswani, Noam M. Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need.
ArXiv, abs/1706.03762, 2017.
[49] Konstantinos Vougioukas, Stavros Petridis, and Maja Pantic. Realistic speech-driven
facial animation with gans. International Journal of Computer Vision, 128:1398–
1413, 2019.
[50] Gang Wang, Peng Zhang, Lei Xie, Wei Huang, and Yufei Zha. Attention-based lip
audio-visual synthesis for talking face generation in the wild. ArXiv, abs/2203.03984,
2022.
[51] Lijuan Wang, Wei Han, Frank K. Soong, and Qiang Huo. Text driven 3d photo-
realistic talking head. In INTERSPEECH, 2011.
[52] Qilong Wang, Banggu Wu, Pengfei Zhu, P. Li, Wangmeng Zuo, and Qinghua Hu.
Eca-net: Efficient channel attention for deep convolutional neural networks. 2020
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages
11531–11539, 2020.
[53] Song Wang, Li Sun, Wei Fan, Jun Sun, Satoshi Naoi, Koichi Shirahata, Takuya Fuk-
agai, Yasumoto Tomita, Atsushi Ike, and Tetsutaro Hashimoto. An automated cnn recommendation system for image classification tasks. In 2017 IEEE International
Conference on Multimedia and Expo (ICME), pages 283–288. IEEE, 2017.
[54] Ting-Chun Wang, Arun Mallya, and Ming-Yu Liu. One-shot free-view neural talking-
head synthesis for video conferencing. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 2021.
[55] Z. Wang, E.P. Simoncelli, and A.C. Bovik. Multiscale structural similarity for image
quality assessment. In The Thrity-Seventh Asilomar Conference on Signals, Systems
& Computers, 2003, volume 2, pages 1398–1402 Vol.2, 2003.
[56] Zhou Wang, A.C. Bovik, H.R. Sheikh, and E.P. Simoncelli. Image quality assessment:
from error visibility to structural similarity. IEEE Transactions on Image Processing,
13(4):600–612, 2004.
[57] Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In-So Kweon. Cbam: Convo-
lutional block attention module. In ECCV, 2018.
[58] Yong Xu, Qiuqiang Kong, Wenwu Wang, and Mark D. Plumbley. Large-scale
weakly supervised audio classification using gated convolutional neural network. In
2018 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP), pages 121–125, 2018.
[59] Jimei Yang, Brian Price, Scott Cohen, Honglak Lee, and Ming-Hsuan Yang. Object
contour detection with a fully convolutional encoder-decoder network. In Proceedings
of the IEEE conference on computer vision and pattern recognition, pages 193–202,
2016.
[60] Xinwei Yao, Ohad Fried, Kayvon Fatahalian, and Maneesh Agrawala. Iterative text-based editing of talking-heads using neural retargeting. ACM Transactions on Graph-
ics (TOG), 40:1 – 14, 2021.
[61] Shifeng Zhang, Xiangyu Zhu, Zhen Lei, Hailin Shi, Xiaobo Wang, and S. Li. S3fd:
Single shot scale-invariant face detector. 2017 IEEE International Conference on Computer Vision (ICCV), pages 192–201, 2017.
[62] Hang Zhou, Yu Liu, Ziwei Liu, Ping Luo, and Xiaogang Wang. Talking face genera-
tion by adversarially disentangled audio-visual representation. ArXiv, 2019.
[63] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-
image translation using cycle-consistent adversarial networks. In Computer Vision
(ICCV), 2017 IEEE International Conference on, 2017.

指導教授

孫敏德(Min-Te Sun)

審核日期

2022-7-25

推文