摘要: | 回響通常是由天花板、牆壁和地板的聲音反射所造成的,在我們生活的環境中處處都會有回響的存在。對於正常人耳而言,回響所造成的影響並不明顯,不過對於助聽器或其他聽覺輔具的使用者而言,回響會嚴重影響他們語音接收的品質,即使在安靜的環境下,也可能會聽不清楚。現有的傳統除回響方法,雖然也可以表現出相當不錯的性能,但它們仍都需要已知的環境特性來抑制回響,這在真實環境下很難去實現。現今,深度學習發展迅速,利用大量的訓練資料來訓練深度神經網路(Deep neural network, DNN)便可以得到輸出與輸入之間的非線性關係,改善了傳統方法對環境的依賴性。本論文利用實驗室先前所錄製的TMHINT(Taiwan mandarin hearing in noise test)句子作為實驗語料,模擬了許多不同環境下的回響語料來進行訓練(2160句)及測試(480句),再從語音中萃取對數功率聲譜(Logarithmic power spectrogram, LPS)作為輸入特徵,讓深度神經網路來進行監督式學習。本實驗中使用的神經網路架構有深層降噪自編碼器(Deep denoise autoencoder, DDAE)與整體式深度與集成學習演算法(Integrated deep and ensemble learning algorithm, IDEA),並比較他們彼此間的優劣勢及結合其他網路架構所呈現出來的結果,依據不同的訓練目標,網路的性能也不一致。在這我們也比較了映射(Mapping)與遮罩(Masking)方式的區別。為了證實比較結果的可信度,我們使用了國外語音研究上常用的TIMIT語料,加以驗證我們的結果。最後,藉由語音品質感知度(Perceptual evaluation of speech quality, PESQ)與短時客觀語音清晰度(Short time objective intelligibility, STOI)等評估方法來對各項結果做評估,來找出最合適的網路架構及輸出目標。評估結果表明,DDAE與IDEA兩者跟殘差網路(Residual networks)做結合的效益是最佳的(PESQ平均值2.2以上、STOI平均值0.8以上),而在遮罩目標下,DDAE無論是在架構上或是回響抑制能力上的表現,都明顯優於IDEA。;Reverberation, generally caused by sound reflections from ceilings, floors, and walls, exists everywhere in the environment we live in. For normal human ears, the effect of reverberation is not obvious. However, for the people who need hearing aids or other assistive hearing devices, reverberation significantly affect the quality of speech reception. Even in a noiseless environment, reverberation still makes people with hearing loss unable to hear clearly. Although traditional dereverberation approaches can show reasonably good performance, they still rely on the knowledge of environmental characteristics, which are difficult to be obtained in the real environment. Nowadays, the rapid-growing deep learning is a powerful tool that can be used for dereverberation. By using a large amount of data to train the deep neural networks (DNNs), we can obtain the nonlinear relationship between input and output. Comparing to the traditional methods, DNN eliminates the environment dependence and improve the performance. In this thesis, sentences from TMHINT (Taiwan mandarin hearing in noise test) previously recorded by our research team, are chosen as the speech material for experiments, and simulated the reverberant speech under different conditions for training (2160 sentences) and testing (480 sentences). The logarithmic power spectrum (LPS) was extracted from the speech as the input feature, and the DNN is used for supervised learning. The neural network architecture utilized in this experiment includes the deep denoising autoencoder (DDAE) and the integrated deep and ensemble learning algorithm (IDEA). This research, compares their advantages and disadvantages, and combines with other network architectures. Different training targets with the same network are also compared for the performance. The differences between mapping and masking methods are evaluated. In order to verify the credibility of the comparison results, we also used the TIMIT corpus for experiments. The evaluation methods perceptual evaluation of speech quality (PESQ) and short-time objective voice intelligibility (STOI) are used to assess the results, which give most suitable network architecture and output target. The evaluation results showed that both of the combination of DDAE with residual network and IDEA with residual network were the best among all of the methods. (Average PESQ score is equal to 2.2 or more, while STOI is equal to 0.8 or more). Furthermore, under masking, DDAE offered a better indications of the architecture and dereverberation capability compared to IDEA. |