基於窗注意力和信心融合的聽視覺語音辨識

DC 欄位	值	語言
DC.contributor	資訊工程學系	zh_TW
DC.creator	鍾程洋	zh_TW
DC.creator	Cheng-Yang Chung	en_US
dc.date.accessioned	2024-8-19T07:39:07Z
dc.date.available	2024-8-19T07:39:07Z
dc.date.issued	2024
dc.identifier.uri	http://ir.lib.ncu.edu.tw:444/thesis/view_etd.asp?URN=111522039
dc.contributor.department	資訊工程學系	zh_TW
DC.description	國立中央大學	zh_TW
DC.description	National Central University	en_US
dc.description.abstract	Cocktail Party Effect是一種生物心理學上的現象，指的是當人處於嘈雜環境中，大腦能夠選擇性地專注於感興趣的聲音，並忽略其他背景噪音(例如人聲、冷氣聲及汽車喇叭聲等等)。這種自然的多模態感知能力使人類能夠在複雜的聲音環境中辨識和理解特定的語音訊息。在當今科技飛速發展的時代，多模態語音辨識技術成為人機交互界面中不可或缺的一環。由於單一模態的語音辨識系統在特定條件下可能面臨到一系列挑戰，例如嘈雜的環境、語速變化、以及無法辨識口型等問題。為了克服這些挑戰，近期許多研究主要探討多模態的聽視覺語音辨識。本論文”基於窗注意力和信心融合的聽視覺語音辨識”透過修改現有的多模態語音辨識模型架構，目的在於改進現有的融合方法，並且透過深度學習技術提升聽視覺語音辨識技術在高噪音環境下的強健性。我們透過修改 Attention 機制，使得模型能夠在計算注意力分數時也一併考量輸入的噪音程度，從而產生更強健的模態特定特徵表示。	zh_TW
dc.description.abstract	The Cocktail Party Effect is a phenomenon in biopsychology where the brain can selectively focus on sounds of interest while ignoring other background noise in noisy environments. This natural multimodal perception ability allows us to effectively recognize and understand specific speech information in complex auditory environments. In today′s rapidly advancing technological era, multimodal speech recognition technology has become an indispensable part of human-computer interaction interfaces. Single-modal speech recognition systems face a series of challenges under certain conditions, such as noisy environments, varying speech rates, and the inability to recognize lip movements. These challenges are akin to the Cocktail Party Effect, where the human brain can selectively focus on sounds of interest. To overcome these challenges, this thesis, titled ＂ Enhancing Noise Robustness in Audio-Visual Speech Recognition with Window Attention and Confidence Mechanisms＂ aims to enhance the integration of audio and visual information by modifying the existing multimodal speech recognition model architecture. By utilizing deep learning techniques, this approach brings a new perspective to lip-reading and speech recognition technology. We have modified the attention mechanism to enable the model to dynamically perceive the noise level of input modality features, thereby generating more robust modality-specific feature representations.	en_US
DC.subject	聽視覺語音辨識	zh_TW
DC.subject	語音處理	zh_TW
DC.subject	多模態模型	zh_TW
DC.subject	Audio-Viusal Speech Recognition	en_US
DC.subject	Speech processing	en_US
DC.subject	MultiModal	en_US
DC.title	基於窗注意力和信心融合的聽視覺語音辨識	zh_TW
dc.language.iso	zh-TW	zh-TW
DC.title	Audio-Visual Speech Recognition using Window Attention and Confidence Mechanism	en_US
DC.type	博碩士論文	zh_TW
DC.type	thesis	en_US
DC.publisher	National Central University	en_US

博碩士論文 111522039 完整後設資料紀錄