姓名 |
鄧氏陲殷(Dang Thi Thuy An)
查詢紙本館藏 |
畢業系所 |
資訊工程學系 |
論文名稱 |
深度神經網路於音訊、語音和影像之研究 (Deep Neural Networks for Audio, Speech, and Image Applications)
|
相關論文 | |
檔案 |
[Endnote RIS 格式]
[Bibtex 格式]
[相關文章] [文章引用] [完整記錄] [館藏目錄] 至系統瀏覽論文 ( 永不開放)
|
摘要(中) |
這項工作旨在為人工智能領域的幾個問題的發展做出貢獻,包括語音情緒辨識 (SER)、聲學場景分類 (ASC) 和基於内容的影像檢索 (CBIR)。 這些問題來自各個領域,並有許多實際應用。例如,SER 可用於人機交互和心理保健,而 ASC 有助於了解周圍環境,這對於機器人導航、情境感知和監控應用非常有用。CBIR 涉及根據給定的查詢影像識別數據庫中的相關影像,可用於各種類型的影像檢索。 在本論文中,我們提出了使用深度神經網絡 (DNN) 來解決這些問題的方法。
具體來說,我們針對 SER 問題開發了一種簡單而有效的數據增強 (DA) 方法。 由於數據稀缺和標籤模糊,SER 很困難,DNN 模型容易過度擬合,這會導致測試數據泛化能力差。我們的 DA 方法創建的新數據樣本可能比原始數據樣本噪聲更大或模糊性更低,並且在我們對兩個公共數據集的實驗中,它證明了優於其他 DA 方法。 在 ASC 中,我們關注在跨設備設置中使用 DNN 模型時性能下降的問題,其中訓練和測試數據使用不同的設備記錄。我們提出了一個具有兩種 DA 方法的 ASC 系統:用於減少域間隙的 MixStyleFreq 和用於減輕 DNN 對主導設備的偏差的頻譜校正。 與其他 DA 方法相比,這些方法顯著提高了泛化性能,並取得了有競爭力的結果。 最後,我們針對 CBIR 中的美容產品影像檢索問題開發了一個完全端到端的 DNN 模型。 該模型不需要手動特徵聚合或後處理,在 Perfect-500K 數據集上的實驗結果顯示了其有效性和高檢索精度。
|
摘要(英) |
The work aims to contribute to the development of several problems in the field of artificial intelligence, including speech emotion recognition (SER), acoustic scene classification (ASC), and content-based image retrieval (CBIR). These problems come from various domains and have many practical applications. For example, SER can be used in human-machine interaction and mental healthcare, while ASC helps to understand the surrounding environment, which is useful for robot navigation, context awareness, and surveillance applications. CBIR involves identifying relevant images in a database based on a given query image, and can be used in various types of image search. In this thesis, we propose approaches using deep neural networks (DNNs) to address these problems.
Specifically, we develop a simple yet effective data augmentation (DA) method for the SER problem. SER is difficult due to the scarcity of data and ambiguity of labels, and DNN models are prone to overfitting, which can lead to poor generalization on test data. Our DA method creates new data samples that may be noisier or less ambiguous than the original ones, and in our experiments with two public datasets, it demonstrates superiority over other DA methods. In ASC, we focus on the problem of performance degradation when DNN models are used in a cross-device setting, where the train and test data are recorded using different devices. We propose an ASC system with two DA methods: MixStyleFreq to reduce domain gaps, and spectrum correction to mitigate the bias of DNNs toward dominant devices. These methods significantly improve the generalization performance compared to other DA methods and achieve competitive results. Finally, we develop a fully end-to-end DNN model for the beauty product image retrieval problem in CBIR. This model requires no manual feature aggregation or post-processing, and experimental results on the Perfect-500K dataset show its effectiveness with high retrieval accuracy.
|
關鍵字(中) |
★ EMix ★ 語音情緒辨識 ★ 聲學場景分類 ★ MixStyleFreq ★ 影 像檢索 ★ 影像檢索 ★ 基於内容的影像檢索 ★ 美容產品影像檢索 |
關鍵字(英) |
★ EMix ★ Speech Emotion Recognition ★ Acoustic Scene Classification ★ MixStyleFreq ★ image retrieval ★ content based image retrieval ★ beauty product image retrieval |
論文目次 |
Table of Contents
Abstract vi
Acknowledgements vii
Table of Contents vii
List of Figures x
List of Tables xi
1 Introduction 1
1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Contributions and Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 EMix: A Data Augmentation Method for Speech Emotion Recognition 5
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.1 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.2 Our contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 Mixup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.2 EMix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Network architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Data preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Network and training details . . . . . . . . . . . . . . . . . . . . . 13
Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4.3 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5 Deep Convolutional Neural Networks with Multiple Data
Augmentation Methods for Speech Emotion Recognition . .16
2.5.1 The proposed SER system . . . . . . . . . . . . . . . . . . . . . . . 16
Data augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Experimental settings . . . . . . . . . . . . . . . . . . . . . . . . . 18
Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3 Acoustic Scene Classification with Multiple Devices 21
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 Proposed methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2.1 MixStyleFreq . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2.2 Spectrum Correction . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2.3 Network architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3.2 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3.3 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3.4 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4 Learning to Remember Beauty Products 34
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.1.1 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.1.2 Our contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2.1 Data augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2.2 Network architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2.3 Losses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2.4 Training procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.3.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.3.2 Training details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.3.3 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5 Conclusions 44
References 46
xi |
參考文獻 |
[1] David E Rumelhart, Geoffrey E Hinton, and Ronald JWilliams. “Learning representations by back-propagating errors”. In: nature 323.6088 (1986), pp. 533–536.
[2] Kyo-Joong Oh et al. “A chatbot for psychiatric counseling in mental healthcare service based on emotional dialogue analysis and sentence generation”. In: 2017 18th IEEE international conference on mobile data management (MDM). IEEE. 2017,
pp. 371–375.
[3] M Shamim Hossain et al. “Audio–visual emotion-aware cloud gaming framework”.In: IEEE Transactions on Circuits and Systems for Video Technology 25.12 (2015), pp. 2105–2118.
[4] Hans-Jörg Vögel et al. “Emotion-awareness for intelligent vehicle assistants: A research agenda”. In: Proceedings of the 1st International Workshop on Software Engineering for AI in Autonomous Systems. 2018, pp. 11–15.
[5] Carlos Busso et al. “IEMOCAP: Interactive emotional dyadic motion capture database”. In: Language resources and evaluation 42.4 (2008), pp. 335–359.
[6] Raghavendra Pappagari et al. “Copypaste: An augmentation method for speech emotion recognition”. In: ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2021, pp. 6324–6328.
[7] Jiaxing Liu et al. “Speech emotion recognition with local-global aware deep representation learning”. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2020, pp. 7174–7178.
[8] Caroline Etienne et al. “Cnn+ lstm architecture for speech emotion recognition with data augmentation”. In: arXiv preprint arXiv:1802.05630 (2018).
[9] Nicolae-Catalin Ristea and Radu Tudor Ionescu. “Self-paced ensemble learning for speech and audio classification”. In: arXiv preprint arXiv:2103.11988 (2021).
[10] Anish Nediyanchath, Periyasamy Paramasivam, and Promod Yenigalla. “Multihead attention for speech emotion recognition with auxiliary learning of gender recognition”. In: ICASSP 2020-2020 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP). IEEE. 2020, pp. 7179–7183.
[11] Mingke Xu, Fan Zhang, and Wei Zhang. “Head fusion: Improving the accuracy and robustness of speech emotion recognition on the IEMOCAP and RAVDESS dataset”. In: IEEE Access 9 (2021), pp. 74539–74549.
[12] Siddique Latif et al. “Direct modelling of speech emotion from raw speech”. In: arXiv preprint arXiv:1904.03833 (2019).
[13] Navdeep Jaitly and Geoffrey E Hinton. “Vocal tract length perturbation (VTLP) improves speech recognition”. In: Proc. ICML Workshop on Deep Learning for Audio, Speech and Language. Vol. 117. 2013, p. 21.
[14] Siddique Latif et al. “Augmenting generative adversarial networks for speech emotion recognition”. In: arXiv preprint arXiv:2005.08447 (2020).
[15] Takuya Fujioka et al. “Addressing ambiguity of emotion labels through meta learning”. In: arXiv preprint arXiv:1911.02216 (2019).
[16] Yifei Yin et al. “Progressive co-teaching for ambiguous speech emotion recognition”. In: ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2021, pp. 6264–6268.
[17] Houwei Cao et al. “Crema-d: Crowd-sourced emotional multimodal actors dataset”. In: IEEE transactions on affective computing 5.4 (2014), pp. 377–390.
[18] Hongyi Zhang et al. “mixup: Beyond empirical risk minimization”. In: arXiv preprint arXiv:1710.09412 (2017).
[19] Hao Yu, Huanyu Wang, and Jianxin Wu. “Mixup without hesitation”. In: International Conference on Image and Graphics. Springer. 2021, pp. 143–154.
[20] Ashish Vaswani et al. “Attention is all you need”. In: Advances in neural information processing systems 30 (2017).
[21] Dan Hendrycks and Kevin Gimpel. “Gaussian error linear units (gelus)”. In: arXiv preprint arXiv:1606.08415 (2016).
[22] QiangWang et al. “Learning deep transformer models for machine translation”. In: arXiv preprint arXiv:1906.01787 (2019).
[23] Yuan Gao et al. “Domain-adversarial autoencoder with attention based feature level fusion for speech emotion recognition”. In: ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2021,
pp. 6314–6318.
[24] Linh Vu et al. “Improved speech emotion recognition based on music-related audio features”. In: 2022 30th European Signal Processing Conference (EUSIPCO). IEEE. 2022, pp. 120–124.
[25] Florinel-Alin Croitoru et al. “LeRaC: Learning Rate Curriculum”. In: arXiv preprint arXiv:2205.09180 (2022).
[26] Sergey Ioffe and Christian Szegedy. “Batch normalization: Accelerating deep network training by reducing internal covariate shift”. In: International conference on machine learning. PMLR. 2015, pp. 448–456.
[27] Abien Fred Agarap. “Deep learning using rectified linear units (relu)”. In: arXiv preprint arXiv:1803.08375 (2018).
[28] Gao Huang et al. “Densely connected convolutional networks”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2017, pp. 4700–4708. |
指導教授 |
王家慶(Jia-Ching Wang)
|
審核日期 |
2023-2-23 |
推文 |
facebook plurk twitter funp google live udn HD myshare reddit netvibes friend youpush delicious baidu
|
網路書籤 |
Google bookmarks del.icio.us hemidemi myshare
|