博碩士論文 106582611 詳細資訊




以作者查詢圖書館館藏 以作者查詢臺灣博碩士 以作者查詢全國書目 勘誤回報 、線上人數:31 、訪客IP:3.144.121.155
姓名 鄧氏陲殷(Dang Thi Thuy An)  查詢紙本館藏   畢業系所 資訊工程學系
論文名稱 深度神經網路於音訊、語音和影像之研究
(Deep Neural Networks for Audio, Speech, and Image Applications)
相關論文
★ Single and Multi-Label Environmental Sound Recognition with Gaussian Process★ 波束形成與音訊前處理之嵌入式系統實現
★ 語音合成及語者轉換之應用與設計★ 基於語意之輿情分析系統
★ 高品質口述系統之設計與應用★ 深度學習及加速強健特徵之CT影像跟骨骨折辨識及偵測
★ 基於風格向量空間之個性化協同過濾服裝推薦系統★ RetinaNet應用於人臉偵測
★ 金融商品走勢預測★ 整合深度學習方法預測年齡以及衰老基因之研究
★ 漢語之端到端語音合成研究★ 基於 ARM 架構上的 ORB-SLAM2 的應用與改進
★ 基於深度學習之指數股票型基金趨勢預測★ 探討財經新聞與金融趨勢的相關性
★ 基於卷積神經網路的情緒語音分析★ 運用深度學習方法預測阿茲海默症惡化與腦中風手術存活
檔案 [Endnote RIS 格式]    [Bibtex 格式]    [相關文章]   [文章引用]   [完整記錄]   [館藏目錄]   至系統瀏覽論文 ( 永不開放)
摘要(中) 這項工作旨在為人工智能領域的幾個問題的發展做出貢獻,包括語音情緒辨識 (SER)、聲學場景分類 (ASC) 和基於内容的影像檢索 (CBIR)。 這些問題來自各個領域,並有許多實際應用。例如,SER 可用於人機交互和心理保健,而 ASC 有助於了解周圍環境,這對於機器人導航、情境感知和監控應用非常有用。CBIR 涉及根據給定的查詢影像識別數據庫中的相關影像,可用於各種類型的影像檢索。 在本論文中,我們提出了使用深度神經網絡 (DNN) 來解決這些問題的方法。
具體來說,我們針對 SER 問題開發了一種簡單而有效的數據增強 (DA) 方法。 由於數據稀缺和標籤模糊,SER 很困難,DNN 模型容易過度擬合,這會導致測試數據泛化能力差。我們的 DA 方法創建的新數據樣本可能比原始數據樣本噪聲更大或模糊性更低,並且在我們對兩個公共數據集的實驗中,它證明了優於其他 DA 方法。 在 ASC 中,我們關注在跨設備設置中使用 DNN 模型時性能下降的問題,其中訓練和測試數據使用不同的設備記錄。我們提出了一個具有兩種 DA 方法的 ASC 系統:用於減少域間隙的 MixStyleFreq 和用於減輕 DNN 對主導設備的偏差的頻譜校正。 與其他 DA 方法相比,這些方法顯著提高了泛化性能,並取得了有競爭力的結果。 最後,我們針對 CBIR 中的美容產品影像檢索問題開發了一個完全端到端的 DNN 模型。 該模型不需要手動特徵聚合或後處理,在 Perfect-500K 數據集上的實驗結果顯示了其有效性和高檢索精度。
摘要(英) The work aims to contribute to the development of several problems in the field of artificial intelligence, including speech emotion recognition (SER), acoustic scene classification (ASC), and content-based image retrieval (CBIR). These problems come from various domains and have many practical applications. For example, SER can be used in human-machine interaction and mental healthcare, while ASC helps to understand the surrounding environment, which is useful for robot navigation, context awareness, and surveillance applications. CBIR involves identifying relevant images in a database based on a given query image, and can be used in various types of image search. In this thesis, we propose approaches using deep neural networks (DNNs) to address these problems.
Specifically, we develop a simple yet effective data augmentation (DA) method for the SER problem. SER is difficult due to the scarcity of data and ambiguity of labels, and DNN models are prone to overfitting, which can lead to poor generalization on test data. Our DA method creates new data samples that may be noisier or less ambiguous than the original ones, and in our experiments with two public datasets, it demonstrates superiority over other DA methods. In ASC, we focus on the problem of performance degradation when DNN models are used in a cross-device setting, where the train and test data are recorded using different devices. We propose an ASC system with two DA methods: MixStyleFreq to reduce domain gaps, and spectrum correction to mitigate the bias of DNNs toward dominant devices. These methods significantly improve the generalization performance compared to other DA methods and achieve competitive results. Finally, we develop a fully end-to-end DNN model for the beauty product image retrieval problem in CBIR. This model requires no manual feature aggregation or post-processing, and experimental results on the Perfect-500K dataset show its effectiveness with high retrieval accuracy.
關鍵字(中) ★ EMix
★ 語音情緒辨識
★ 聲學場景分類
★ MixStyleFreq
★ 影 像檢索
★ 影像檢索
★ 基於内容的影像檢索
★ 美容產品影像檢索
關鍵字(英) ★ EMix
★ Speech Emotion Recognition
★ Acoustic Scene Classification
★ MixStyleFreq
★ image retrieval
★ content based image retrieval
★ beauty product image retrieval
論文目次 Table of Contents
Abstract vi
Acknowledgements vii
Table of Contents vii
List of Figures x
List of Tables xi
1 Introduction 1
1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Contributions and Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 EMix: A Data Augmentation Method for Speech Emotion Recognition 5
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.1 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.2 Our contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 Mixup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.2 EMix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Network architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Data preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Network and training details . . . . . . . . . . . . . . . . . . . . . 13
Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4.3 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5 Deep Convolutional Neural Networks with Multiple Data
Augmentation Methods for Speech Emotion Recognition . .16
2.5.1 The proposed SER system . . . . . . . . . . . . . . . . . . . . . . . 16
Data augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Experimental settings . . . . . . . . . . . . . . . . . . . . . . . . . 18
Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3 Acoustic Scene Classification with Multiple Devices 21
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 Proposed methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2.1 MixStyleFreq . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2.2 Spectrum Correction . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2.3 Network architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3.2 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3.3 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3.4 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4 Learning to Remember Beauty Products 34
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.1.1 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.1.2 Our contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2.1 Data augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2.2 Network architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2.3 Losses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2.4 Training procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.3.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.3.2 Training details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.3.3 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5 Conclusions 44
References 46
xi
參考文獻 [1] David E Rumelhart, Geoffrey E Hinton, and Ronald JWilliams. “Learning representations by back-propagating errors”. In: nature 323.6088 (1986), pp. 533–536.
[2] Kyo-Joong Oh et al. “A chatbot for psychiatric counseling in mental healthcare service based on emotional dialogue analysis and sentence generation”. In: 2017 18th IEEE international conference on mobile data management (MDM). IEEE. 2017,
pp. 371–375.
[3] M Shamim Hossain et al. “Audio–visual emotion-aware cloud gaming framework”.In: IEEE Transactions on Circuits and Systems for Video Technology 25.12 (2015), pp. 2105–2118.
[4] Hans-Jörg Vögel et al. “Emotion-awareness for intelligent vehicle assistants: A research agenda”. In: Proceedings of the 1st International Workshop on Software Engineering for AI in Autonomous Systems. 2018, pp. 11–15.
[5] Carlos Busso et al. “IEMOCAP: Interactive emotional dyadic motion capture database”. In: Language resources and evaluation 42.4 (2008), pp. 335–359.
[6] Raghavendra Pappagari et al. “Copypaste: An augmentation method for speech emotion recognition”. In: ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2021, pp. 6324–6328.
[7] Jiaxing Liu et al. “Speech emotion recognition with local-global aware deep representation learning”. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2020, pp. 7174–7178.
[8] Caroline Etienne et al. “Cnn+ lstm architecture for speech emotion recognition with data augmentation”. In: arXiv preprint arXiv:1802.05630 (2018).
[9] Nicolae-Catalin Ristea and Radu Tudor Ionescu. “Self-paced ensemble learning for speech and audio classification”. In: arXiv preprint arXiv:2103.11988 (2021).
[10] Anish Nediyanchath, Periyasamy Paramasivam, and Promod Yenigalla. “Multihead attention for speech emotion recognition with auxiliary learning of gender recognition”. In: ICASSP 2020-2020 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP). IEEE. 2020, pp. 7179–7183.
[11] Mingke Xu, Fan Zhang, and Wei Zhang. “Head fusion: Improving the accuracy and robustness of speech emotion recognition on the IEMOCAP and RAVDESS dataset”. In: IEEE Access 9 (2021), pp. 74539–74549.
[12] Siddique Latif et al. “Direct modelling of speech emotion from raw speech”. In: arXiv preprint arXiv:1904.03833 (2019).
[13] Navdeep Jaitly and Geoffrey E Hinton. “Vocal tract length perturbation (VTLP) improves speech recognition”. In: Proc. ICML Workshop on Deep Learning for Audio, Speech and Language. Vol. 117. 2013, p. 21.
[14] Siddique Latif et al. “Augmenting generative adversarial networks for speech emotion recognition”. In: arXiv preprint arXiv:2005.08447 (2020).
[15] Takuya Fujioka et al. “Addressing ambiguity of emotion labels through meta learning”. In: arXiv preprint arXiv:1911.02216 (2019).
[16] Yifei Yin et al. “Progressive co-teaching for ambiguous speech emotion recognition”. In: ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2021, pp. 6264–6268.
[17] Houwei Cao et al. “Crema-d: Crowd-sourced emotional multimodal actors dataset”. In: IEEE transactions on affective computing 5.4 (2014), pp. 377–390.
[18] Hongyi Zhang et al. “mixup: Beyond empirical risk minimization”. In: arXiv preprint arXiv:1710.09412 (2017).
[19] Hao Yu, Huanyu Wang, and Jianxin Wu. “Mixup without hesitation”. In: International Conference on Image and Graphics. Springer. 2021, pp. 143–154.
[20] Ashish Vaswani et al. “Attention is all you need”. In: Advances in neural information processing systems 30 (2017).
[21] Dan Hendrycks and Kevin Gimpel. “Gaussian error linear units (gelus)”. In: arXiv preprint arXiv:1606.08415 (2016).
[22] QiangWang et al. “Learning deep transformer models for machine translation”. In: arXiv preprint arXiv:1906.01787 (2019).
[23] Yuan Gao et al. “Domain-adversarial autoencoder with attention based feature level fusion for speech emotion recognition”. In: ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2021,
pp. 6314–6318.
[24] Linh Vu et al. “Improved speech emotion recognition based on music-related audio features”. In: 2022 30th European Signal Processing Conference (EUSIPCO). IEEE. 2022, pp. 120–124.
[25] Florinel-Alin Croitoru et al. “LeRaC: Learning Rate Curriculum”. In: arXiv preprint arXiv:2205.09180 (2022).
[26] Sergey Ioffe and Christian Szegedy. “Batch normalization: Accelerating deep network training by reducing internal covariate shift”. In: International conference on machine learning. PMLR. 2015, pp. 448–456.
[27] Abien Fred Agarap. “Deep learning using rectified linear units (relu)”. In: arXiv preprint arXiv:1803.08375 (2018).
[28] Gao Huang et al. “Densely connected convolutional networks”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2017, pp. 4700–4708.
指導教授 王家慶(Jia-Ching Wang) 審核日期 2023-2-23
推文 facebook   plurk   twitter   funp   google   live   udn   HD   myshare   reddit   netvibes   friend   youpush   delicious   baidu   
網路書籤 Google bookmarks   del.icio.us   hemidemi   myshare   

若有論文相關問題,請聯絡國立中央大學圖書館推廣服務組 TEL:(03)422-7151轉57407,或E-mail聯絡  - 隱私權政策聲明