基於窗注意力和信心融合的聽視覺語音辨識

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：87

、訪客IP：13.59.130.154

姓名

鍾程洋(Cheng-Yang Chung) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

基於窗注意力和信心融合的聽視覺語音辨識
(Audio-Visual Speech Recognition using Window Attention and Confidence Mechanism)

相關論文

★ Single and Multi-Label Environmental Sound Recognition with Gaussian Process	★ 波束形成與音訊前處理之嵌入式系統實現
★ 語音合成及語者轉換之應用與設計	★ 基於語意之輿情分析系統
★ 高品質口述系統之設計與應用	★ 深度學習及加速強健特徵之CT影像跟骨骨折辨識及偵測
★ 基於風格向量空間之個性化協同過濾服裝推薦系統	★ RetinaNet應用於人臉偵測
★ 金融商品走勢預測	★ 整合深度學習方法預測年齡以及衰老基因之研究
★ 漢語之端到端語音合成研究	★ 基於 ARM 架構上的 ORB-SLAM2 的應用與改進
★ 基於深度學習之指數股票型基金趨勢預測	★ 探討財經新聞與金融趨勢的相關性
★ 基於卷積神經網路的情緒語音分析	★ 運用深度學習方法預測阿茲海默症惡化與腦中風手術存活

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 ( 永不開放)

摘要(中)

Cocktail Party Effect是一種生物心理學上的現象，指的是當人處於嘈雜環境中，
大腦能夠選擇性地專注於感興趣的聲音，並忽略其他背景噪音(例如人聲、冷氣聲及汽車
喇叭聲等等)。這種自然的多模態感知能力使人類能夠在複雜的聲音環境中辨識和理解
特定的語音訊息。在當今科技飛速發展的時代，多模態語音辨識技術成為人機交互界面
中不可或缺的一環。由於單一模態的語音辨識系統在特定條件下可能面臨到一系列挑戰，
例如嘈雜的環境、語速變化、以及無法辨識口型等問題。為了克服這些挑戰，近期許多
研究主要探討多模態的聽視覺語音辨識。本論文”基於窗注意力和信心融合的聽視覺語
音辨識”透過修改現有的多模態語音辨識模型架構，目的在於改進現有的融合方法，並
且透過深度學習技術提升聽視覺語音辨識技術在高噪音環境下的強健性。我們透過修改
Attention 機制，使得模型能夠在計算注意力分數時也一併考量輸入的噪音程度，從而產
生更強健的模態特定特徵表示。

摘要(英)

The Cocktail Party Effect is a phenomenon in biopsychology where the brain can
selectively focus on sounds of interest while ignoring other background noise in noisy
environments. This natural multimodal perception ability allows us to effectively recognize and
understand specific speech information in complex auditory environments. In today′s rapidly
advancing technological era, multimodal speech recognition technology has become an
indispensable part of human-computer interaction interfaces. Single-modal speech recognition
systems face a series of challenges under certain conditions, such as noisy environments,
varying speech rates, and the inability to recognize lip movements. These challenges are akin
to the Cocktail Party Effect, where the human brain can selectively focus on sounds of interest.
To overcome these challenges, this thesis, titled " Enhancing Noise Robustness in Audio-Visual
Speech Recognition with Window Attention and Confidence Mechanisms" aims to enhance the
integration of audio and visual information by modifying the existing multimodal speech
recognition model architecture. By utilizing deep learning techniques, this approach brings a
new perspective to lip-reading and speech recognition technology. We have modified the
attention mechanism to enable the model to dynamically perceive the noise level of input
modality features, thereby generating more robust modality-specific feature representations.

關鍵字(中)

★ 聽視覺語音辨識
★ 語音處理
★ 多模態模型

關鍵字(英)

★ Audio-Viusal Speech Recognition
★ Speech processing
★ MultiModal

論文目次

中文摘要.....................................................................................................................................i
Abstract.......................................................................................................................................ii
章節目次...................................................................................................................................iii
圖目錄.......................................................................................................................................vi
表目錄.....................................................................................................................................viii
第一章緒論........................................................................................................................1
1.1 背景........................................................................................................................1
1.2 研究動機與目的....................................................................................................2
1.3 研究方法與章節概要............................................................................................2
第二章相關文獻及文獻探討............................................................................................5
2.1 Recurrent Neural Networks (RNNs) ......................................................................5
2.1.1. Long Short-Term Memory (LSTM) ...........................................................6
2.2 注意力機制 Attention Mechanism ........................................................................8
2.2.1. Self-Attention 演算法 ..............................................................................10
2.2.2. Transformer 模型 ....................................................................................12
2.2.3. Positional Encoding..................................................................................15
2.3 Hidden Unit BERT(HuBERT)模型......................................................................16
2.3.1. Hubert 預訓練方式..................................................................................17
2.3.2. HuBERT 實驗結果 ..................................................................................18
iv
2.4 Audio Visual Hidden Unit BERT(AV-HuBERT) .................................................19
2.4.1. AV-HuBERT 資料前處理........................................................................21
2.4.2. AV-HuBERT 預訓練方法........................................................................21
2.4.3. AV-HuBERT 實驗結果...........................................................................22
2.5 Modality-Invariant Representation GAN (MIR-GAN)........................................24
2.5.1. MIR-GAN 模型架構...............................................................................25
2.5.2. MIR-GAN 實驗結果...............................................................................28
2.6 Connectionist temporal classification Loss (CTC Loss)......................................29
2.6.1. CTC 演算法 .............................................................................................30
2.6.2. 解碼函數..................................................................................................32
2.7 Sequence to Sequence Loss (Seq2Seq Loss)........................................................33
第三章基於窗注意力機制及信心融合之聽視覺語音辨識模型................................................35
3.1 特徵噪音權重預測網路......................................................................................35
3.2 窗注意力機制......................................................................................................36
3.3 基於信心指數之模型特徵融合方法..................................................................38
第四章實驗結果與討論..................................................................................................41
4.1 實驗設備..............................................................................................................41
4.2 資料集介紹..........................................................................................................42
4.2.1. Voxceleb2 .................................................................................................42

4.2.2. LRS3.....................................................................................................................43
4.2.3. MUSAN................................................................................................................45
4.3 實驗與討論..........................................................................................................45
4.3.1. 消融實驗..................................................................................................46
4.3.2. Unseen Noise............................................................................................48
4.3.3. 特徵權重預測網路之分析......................................................................49
第五章結論及未來方向..................................................................................................52
第六章參考文獻..............................................................................................................53

參考文獻

[1] E. C. Cherry, "Some experiments on the recognition of speech, with one and with two
ears," *Journal of the Acoustical Society of America*, vol. 25, pp. 975-979, 1953.
[2] Hinton, Geoffrey E., Simon Osindero, and Yee-Whye Teh. "A fast learning algorithm
for deep belief nets." Neural computation 18.7 (2006): 1527-1554.
[3] Hinton, Geoffrey E., and Ruslan R. Salakhutdinov. "Reducing the dimensionality of
data with neural networks." science 313.5786 (2006): 504-507.
[4] LeCun, Yann, et al. "Backpropagation applied to handwritten zip code recognition."
Neural computation 1.4 (1989): 541-551.
[5] Graves, Alex, and Alex Graves. "Long short-term memory." Supervised sequence
labelling with recurrent neural networks (2012): 37-45.
[6] Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine
translation by jointly learning to align and translate." arXiv preprint arXiv:1409.0473
(2014).
[7] Rabiner, Lawrence R. "A tutorial on hidden Markov models and selected applications
in speech recognition." Proceedings of the IEEE 77.2 (1989): 257-286.
[8] Reynolds, Douglas A. "Gaussian mixture models." Encyclopedia of biometrics
741.659-663 (2009).
[9] Jelinek, Frederick. "Statistical methods for speech recognition". MIT press, 1998.
[10] Graves, Alex, Abdel-rahman Mohamed, and Geoffrey Hinton. "Speech recognition
with deep recurrent neural networks." 2013 IEEE international conference on
acoustics, speech and signal processing. Ieee, 2013.
[11] Baevski, Alexei, et al. "wav2vec 2.0: A framework for self-supervised learning of
speech representations." Advances in neural information processing systems 33 (2020):
12449-12460.
[12] Hsu, Wei-Ning, et al. "Hubert: Self-supervised speech representation learning by masked prediction of hidden units." IEEE/ACM transactions on audio, speech, and
language processing 29 (2021): 3451-3460.
[13] Hu, Yuchen, et al. "MIR-GAN: Refining frame-level modality-invariant
representations with adversarial network for audio-visual speech recognition." arXiv
preprint arXiv:2306.10567 (2023).
[14] Shi, Bowen, et al. "Learning audio-visual speech representation by masked
multimodal cluster prediction." arXiv preprint arXiv:2201.02184 (2022).
[15] Shi, Bowen, Wei-Ning Hsu, and Abdelrahman Mohamed. "Robust self-supervised
audio-visual speech recognition." arXiv preprint arXiv:2201.01763 (2022).
[16] Chen, Chen, et al. "Leveraging modality-specific representations for audio-visual
speech recognition via reinforcement learning." Proceedings of the AAAI Conference
on Artificial Intelligence. Vol. 37. No. 11. 2023.
[17] P. Shaw, J. Uszkoreit, and A. Vaswani, “Self-Attention with Relative Position
Representations,” arXiv:1803.02155 [cs], Apr. 2018, Accessed: Jun. 11, 2020.
[Online]. Available: http://arxiv.org/abs/1803.02155.
[18] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information
processing systems 30 (2017).
[19] Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for
language understanding." arXiv preprint arXiv:1810.04805 (2018).
[20] Zhang, Jiaxing, et al. "Fengshenbang 1.0: Being the foundation of chinese cognitive
intelligence." arXiv preprint arXiv:2209.02970 (2022).
[21] Graves, Alex, et al. "Connectionist temporal classification: labelling unsegmented
sequence data with recurrent neural networks." Proceedings of the 23rd international
conference on Machine learning. 2006.
[21] Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with
neural networks." Advances in neural information processing systems 27 (2014).
[22] Chung, Joon Son, Arsha Nagrani, and Andrew Zisserman. "Voxceleb2: Deep speaker
recognition." arXiv preprint arXiv:1806.05622 (2018).
[23] Afouras, Triantafyllos, Joon Son Chung, and Andrew Zisserman. "LRS3-TED: a
large-scale dataset for visual speech recognition." arXiv preprint
arXiv:1809.00496 (2018).
[24] Snyder, David, Guoguo Chen, and Daniel Povey. "Musan: A music, speech, and noise
corpus." arXiv preprint arXiv:1510.08484 (2015).
[25] Shannon, Claude Elwood. "A mathematical theory of communication." ACM
SIGMOBILE mobile computing and communications review 5.1 (2001): 3-55.

指導教授

王家慶(Jia-Ching Wang)

審核日期

2024-8-19

推文