參考文獻 |
[1] Triantafyllos Afouras, Joon Son Chung, Andrew W. Senior, Oriol Vinyals, and An-
drew Zisserman. Deep audio-visual speech recognition. IEEE transactions on pattern
analysis and machine intelligence, 2018.
[2] Triantafyllos Afouras, Joon Son Chung, and Andrew Zisserman. Lrs3-ted: a large-
scale dataset for visual speech recognition. ArXiv, abs/1809.00496, 2018.
[3] Hyeon-woo An and Nammee Moon. Design of recommendation system for tourist
spot using sentiment analysis based on cnn-lstm. Journal of Ambient Intelligence
and Humanized Computing, pages 1–11, 2019.
[4] Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. Segnet: A deep convolu-
tional encoder-decoder architecture for image segmentation. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 39:2481–2495, 2017.
[5] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine transla-
tion by jointly learning to align and translate. CoRR, abs/1409.0473, 2015.
[6] John Brandon. New survey says we’re spending 7 hours per day consum-
ing online media. https://www.forbes.com/sites/johnbbrandon/2020/11/17/
new-survey-says-were-spending-7-hours-per-day-consuming-online-media/
?sh=150a8f416b46. Accessed: 2022-03-22.
[7] Lele Chen, Ross Maddox, Zhiyao Duan, and Chenliang Xu. Hierarchical cross-modal
talking face generation with dynamic pixel-wise loss. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, 06 2019.
[8] Lilin Cheng, Suzhe Wang, Zhimeng Zhang, Yu Ding, Yixing Zheng, Xin Yu, and
Changjie Fan. Write-a-speaker: Text-based emotional and rhythmic talking-head
generation. In AAAI, 2021.
[9] J. S. Chung and A. Zisserman. Out of time: automated lip sync in the wild. In
Workshop on Multi-view Lip-reading, ACCV, 2016.
[10] Joon Son Chung, Amir Jamaludin, and Andrew Zisserman. You said that? ArXiv,
2017.
[11] Joon Son Chung, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. Lip reading
sentences in the wild. 2017 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), Jul 2017.
[12] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua
Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold,
Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words:
Transformers for image recognition at scale. ArXiv, abs/2010.11929, 2021.
[13] A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K Wilson, A. Hassidim, W. T. Freeman,
and M. Rubinstein. Looking to listen at the cocktail party: A speaker-independent
audio-visual model for speech separation. arXiv preprint arXiv:1804.03619, 2018.
[14] Ohad Fried, Ayush Tewari, Michael Zollh ̈ofer, Adam Finkelstein, Eli Shechtman,
Dan B. Goldman, Kyle Genova, Zeyu Jin, Christian Theobalt, and Maneesh
Agrawala. Text-based editing of talking-head video. ACM Transactions on Graphics
(TOG), 38:1 – 14, 2019.
[15] Matthias Funk. How many youtube channels are there? https://www.tubics.com/
blog/number-of-youtube-channels. Accessed: 2022-07-01.
[16] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley,
Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. Generative adversarial nets.
In NIPS, 2014.
[17] Md Rashidul Hasan, Mustafa Jamil, MGRMS Rahman, et al. Speaker identification
using mel frequency cepstral coefficients. variations, 1(4):565–568, 2004.
[18] Kaiming He, X. Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image
recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), pages 770–778, 2016.
[19] Qibin Hou, Daquan Zhou, and Jiashi Feng. Coordinate attention for efficient mo-
bile network design. 2021 IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), pages 13708–13717, 2021.
[20] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. 2018 IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pages 7132–7141, 2018.
[21] Viktor Igeland, Examensarbete utf ̈ort, Handledare Gabriel Eilertsen, and Examina-
tor Jonas Unger. Generating facial animation with emotions in a neural text-to-speech pipeline, 2019.
[22] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. Image-to-image
translation with conditional adversarial networks. CoRR, 2016.
[23] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. Image-to-image trans-lation with conditional adversarial networks. 2017 IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), pages 5967–5976, 2017.
[24] Amir Jamaludin, Joon Son Chung, and Andrew Zisserman. You said that?: Syn-
thesising talking faces from audio. International Journal of Computer Vision, pages
1–13, 2019.
[25] Prajwal K R, Rudrabha Mukhopadhyay, Jerin Philip, Abhishek Jha, Vinay Nam-
boodiri, and C V Jawahar. Towards automatic face-to-face translation. Proceedings
of the 27th ACM International Conference on Multimedia, Oct 2019.
[26] Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. A convolutional neural
network for modelling sentences. In ACL, 2014.
[27] Tero Karras, Miika Aittala, Samuli Laine, Erik H ̈ark ̈onen, Janne Hellsten, Jaakko
Lehtinen, and Timo Aila. Alias-free generative adversarial networks. In Proc.
NeurIPS, 2021.
[28] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and
Timo Aila. Analyzing and improving the image quality of StyleGAN. In Proc.
CVPR, 2020.
[29] Hyeongwoo Kim, Pablo Garrido, Ayush Tewari, Weipeng Xu, Justus Thies, Matthias
Nießner, Patrick P ́erez, Christian Richardt, Michael Zoll ̈ofer, and Christian Theobalt.
Deep video portraits. ACM Transactions on Graphics (TOG), 37(4):163, 2018.
[30] Yoon Kim. Convolutional neural networks for sentence classification. Proceedings of
the 2014 Conference on Empirical Methods in Natural Language Processing, 08 2014.
[31] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization.
CoRR, 2015.
[32] Rithesh Kumar, Jose M. R. Sotelo, Kundan Kumar, Alexandre de Br ́ebisson, and
Yoshua Bengio. Obamanet: Photo-realistic lip-sync from text. ArXiv, 2018.
[33] Christian Ledig, Lucas Theis, Ferenc Husz ́ar, Jose Caballero, Andrew P. Aitken,
Alykhan Tejani, Johannes Totz, Zehan Wang, and Wenzhe Shi. Photo-realistic single
image super-resolution using a generative adversarial network. 2017 IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), pages 105–114, 2017.
[34] Jinglin Liu, Zhiying Zhu, Yi Ren, Wencan Huang, Baoxing Huai, Nicholas Jing Yuan,
and Zhou Zhao. Parallel and high-fidelity text-to-lip generation. 2020 AAAI, 2021.
[35] Ze Lu, Xudong Jiang, and Alex Kot. Deep coupled resnet for low-resolution face
recognition. IEEE Signal Processing Letters, 25(4):526–530, 2018.
[36] Xiaojiao Mao, Chunhua Shen, and Yu-Bin Yang. Image restoration using very deep
convolutional encoder-decoder networks with symmetric skip connections. Advances
in neural information processing systems, 29, 2016.
[37] Takashi Masuko, Takao Kobayashi, Masatsune Tamura, Jun Masubuchi, and Keiichi
Tokuda. Text-to-visual speech synthesis based on parameter generation from hmm.
Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and
Signal Processing, ICASSP ’98 (Cat. No.98CH36181), 6:3745–3748 vol.6, 1998.
[38] Xianfeng Ou, Pengcheng Yan, Yiming Zhang, Bing Tu, Guoyun Zhang, Jianhui Wu,
and Wujing Li. Moving object detection method via resnet-18 with encoder–decoder
structure in complex scenes. IEEE Access, 7:108152–108160, 2019.
[39] Karol J. Piczak. Environmental sound classification with convolutional neural net-
works. In 2015 IEEE 25th International Workshop on Machine Learning for Signal
Processing (MLSP), pages 1–6, 2015.
[40] Prajwal K R, Rudrabha Mukhopadhyay, Vinay Namboodiri, and C. V. Jawahar. A
lip sync expert is all you need for speech to lip generation in the wild. Proceedings
of the 28th ACM International Conference on Multimedia, 2020.
[41] Shinji Sako, Keiichi Tokuda, Takashi Masuko, Takao Kobayashi, and Tadashi Kita-
mura. Hmm-based text-to-audio-visual speech synthesis. In INTERSPEECH, 2000.
[42] Fuhao Shi, Hsiang-Tao Wu, Xin Tong, and Jinxiang Chai. Automatic acquisition
of high-fidelity facial performances using monocular videos. ACM Transactions on
Graphics (TOG), 33:1 – 13, 2014.
[43] Jose M. R. Sotelo, Soroush Mehri, Kundan Kumar, Jo ̃ao Felipe Santos, Kyle Kastner,
Aaron C. Courville, and Yoshua Bengio. Char2wav: End-to-end speech synthesis. In
ICLR, 2017.
[44] Supasorn Suwajanakorn, Steven M. Seitz, and Ira Kemelmacher-Shlizerman. Syn-
thesizing obama. ACM Transactions on Graphics (TOG), 36:1 – 13, 2017.
[45] Justus Thies, Mohamed A. Elgharib, Ayush Tewari, Christian Theobalt, and
Matthias Nießner. Neural voice puppetry: Audio-driven facial reenactment. ArXiv,
2020.
[46] Justus Thies, Michael Zollh ̈ofer, Marc Stamminger, Christian Theobalt, and
Matthias Nießner. Face2face: Real-time face capture and reenactment of rgb videos.
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages
2387–2395, 2016.
[47] Justus Thies, Michael Zollh ̈ofer, Marc Stamminger, Christian Theobalt, and
Matthias Nießner. Face2face: real-time face capture and reenactment of rgb videos.
ArXiv, 2019.
[48] Ashish Vaswani, Noam M. Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need.
ArXiv, abs/1706.03762, 2017.
[49] Konstantinos Vougioukas, Stavros Petridis, and Maja Pantic. Realistic speech-driven
facial animation with gans. International Journal of Computer Vision, 128:1398–
1413, 2019.
[50] Gang Wang, Peng Zhang, Lei Xie, Wei Huang, and Yufei Zha. Attention-based lip
audio-visual synthesis for talking face generation in the wild. ArXiv, abs/2203.03984,
2022.
[51] Lijuan Wang, Wei Han, Frank K. Soong, and Qiang Huo. Text driven 3d photo-
realistic talking head. In INTERSPEECH, 2011.
[52] Qilong Wang, Banggu Wu, Pengfei Zhu, P. Li, Wangmeng Zuo, and Qinghua Hu.
Eca-net: Efficient channel attention for deep convolutional neural networks. 2020
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages
11531–11539, 2020.
[53] Song Wang, Li Sun, Wei Fan, Jun Sun, Satoshi Naoi, Koichi Shirahata, Takuya Fuk-
agai, Yasumoto Tomita, Atsushi Ike, and Tetsutaro Hashimoto. An automated cnn recommendation system for image classification tasks. In 2017 IEEE International
Conference on Multimedia and Expo (ICME), pages 283–288. IEEE, 2017.
[54] Ting-Chun Wang, Arun Mallya, and Ming-Yu Liu. One-shot free-view neural talking-
head synthesis for video conferencing. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 2021.
[55] Z. Wang, E.P. Simoncelli, and A.C. Bovik. Multiscale structural similarity for image
quality assessment. In The Thrity-Seventh Asilomar Conference on Signals, Systems
& Computers, 2003, volume 2, pages 1398–1402 Vol.2, 2003.
[56] Zhou Wang, A.C. Bovik, H.R. Sheikh, and E.P. Simoncelli. Image quality assessment:
from error visibility to structural similarity. IEEE Transactions on Image Processing,
13(4):600–612, 2004.
[57] Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In-So Kweon. Cbam: Convo-
lutional block attention module. In ECCV, 2018.
[58] Yong Xu, Qiuqiang Kong, Wenwu Wang, and Mark D. Plumbley. Large-scale
weakly supervised audio classification using gated convolutional neural network. In
2018 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP), pages 121–125, 2018.
[59] Jimei Yang, Brian Price, Scott Cohen, Honglak Lee, and Ming-Hsuan Yang. Object
contour detection with a fully convolutional encoder-decoder network. In Proceedings
of the IEEE conference on computer vision and pattern recognition, pages 193–202,
2016.
[60] Xinwei Yao, Ohad Fried, Kayvon Fatahalian, and Maneesh Agrawala. Iterative text-based editing of talking-heads using neural retargeting. ACM Transactions on Graph-
ics (TOG), 40:1 – 14, 2021.
[61] Shifeng Zhang, Xiangyu Zhu, Zhen Lei, Hailin Shi, Xiaobo Wang, and S. Li. S3fd:
Single shot scale-invariant face detector. 2017 IEEE International Conference on Computer Vision (ICCV), pages 192–201, 2017.
[62] Hang Zhou, Yu Liu, Ziwei Liu, Ping Luo, and Xiaogang Wang. Talking face genera-
tion by adversarially disentangled audio-visual representation. ArXiv, 2019.
[63] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-
image translation using cycle-consistent adversarial networks. In Computer Vision
(ICCV), 2017 IEEE International Conference on, 2017. |