透過生成式 AI 增強嬰兒哭聲分類 模型效能之研究

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：39

、訪客IP：13.58.134.142

姓名

楊千郁(Chien-Yu Yang) 查詢紙本館藏

畢業系所

企業管理學系

論文名稱

透過生成式 AI 增強嬰兒哭聲分類模型效能之研究
(A Study on Enhancing the Performance of Infant Cry Classification Models Using Generative AI)

相關論文

★ 在社群網站上作互動推薦及研究使用者行為對其效果之影響	★ 以AHP法探討伺服器品牌大廠的供應商遴選指標的權重決定分析
★ 以AHP法探討智慧型手機產業營運中心區位選擇考量關鍵因素之研究	★ 太陽能光電產業經營績效評估－應用資料包絡分析法
★ 建構國家太陽能電池產業競爭力比較模式之研究	★ 以序列採礦方法探討景氣指標與進出口值的關聯
★ ERP專案成員組合對績效影響之研究	★ 推薦期刊文章至適合學科類別之研究
★ 品牌故事分析與比較-以古早味美食產業為例	★ 以方法目的鏈比較Starbucks與Cama吸引消費者購買因素
★ 探討創意店家創業價值之研究- 以赤峰街、民生社區為例	★ 以領先指標預測企業長短期借款變化之研究
★ 應用層級分析法遴選電競筆記型電腦鍵盤供應商之關鍵因子探討	★ 以互惠及利他行為探討信任關係對知識分享之影響
★ 結合人格特質與海報主色以類神經網路推薦電影之研究	★ 資料視覺化圖表與議題之關聯

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 (2028-8-3以後開放)

摘要(中)

嬰兒的哭聲如同成人的言語，嬰兒透過哭泣來表達需求以及感受，使照護者
能夠接受到訊息並且提供對應之照護，然而，現今並無較完整的嬰兒哭聲資料
集，這導致在使用深度學習模型預測哭聲對應之需求時的成效不佳。
本研究旨在探索使用生成對抗網路（GAN）改善嬰兒哭聲分類模型的效能。
由於現有的嬰兒哭聲資料集有限，導致在深度學習分類模型的預測效能不佳。為
了解決這一項問題，我們提出了使用 GAN 生成額外的嬰兒哭聲樣本，進行擴充訓
練資料集的方法，進而提高分類模型之預測效能。
在本研究中，我們預先收集了五種需求（生氣、肚子餓、缺乏安全感、大小
便、想睡覺）下的真實嬰兒哭聲的樣本，並使用 WaveGAN 生成模型來逐一生成
各個需求下額外的嬰兒哭聲樣本。使用這些生成的樣本加入原始資料集產生之新
資料集與未加入樣本之原始資料集進行模型訓練，並採用長短期記憶網絡
（LSTM）深度學習模型進行嬰兒哭聲需求之分類。實驗結果表明，原始資料加入
生成的資料後產生的新資料集訓練模型在測試集上的性能顯著優於使用原始資料
集訓練的模型。這表明，使用 GAN 能夠有效擴充訓練資料集，提高模型的泛化能
力和準確性。綜合以上結果，我們認為使用 GAN 生成額外的嬰兒哭聲樣本是一種
有效的方法，可以改善嬰兒哭聲分類模型在預測上的效能。這對於提高嬰兒之健
康照護的水準和效率具有重要意義及貢獻。

摘要(英)

Infant cries, akin to adult speech, serve as a means for infants to express their needs
and feelings, allowing caregivers to receive cues and provide corresponding care.
However, there lacks a comprehensive dataset of infant cries, leading to suboptimal
performance in using deep learning models to predict infant needs based on cry sounds.
This study aims to explore the improvement of infant cry classification models
using Generative Adversarial Networks （GANs）. Due to the limited availability of
infant cry datasets, the predictive performance of deep learning classification models is
compromised. To address this issue, we propose using GANs to generate additional
infant cry samples to augment the training dataset and subsequently enhance the
predictive performance of the classification model.
In this study, we collected samples of real infant cries corresponding to five
different needs （anger, hunger, insecured, poopee, sleepy） in advance. We then
utilized the WaveGAN to generate additional infant cry samples for each need category.
The generated samples were combined with the original dataset to form a new
augmented dataset. Subsequently, this augmented dataset was used for model training,
while the original dataset was also separately utilized for training. The performances of
models trained on the augmented dataset and the original dataset were compared
individually. We employed Long Short-Term Memory （LSTM） deep learning models
for the classification of infant cry needs.
iii
The experimental results demonstrate that the model trained on the augmented
dataset, which incorporates the generated data, significantly outperforms the model
trained solely on the original dataset. This indicates that GANs effectively augment the
training dataset, thereby improving the model′s generalization ability and accuracy. In
conclusion, we believe that using GANs to generate additional infant cry samples is an
effective approach to enhance the predictive performance of infant cry classification
models. This contributes significantly to improving the standards and efficiency of
infant healthcare.

關鍵字(中)

★ 嬰兒哭聲
★ 深度學習
★ 生成對抗網路
★ 長短期記憶網路

關鍵字(英)

★ Infant cry
★ deep learning
★ Generative Adversarial Networks
★ Long Short-Term Memory

論文目次

目錄
摘要 I
ABSTRACT II
誌謝 IV
目錄 V
圖目錄 VIII
表目錄 X
第一章緒論 1
1-1研究背景 1
1-2 研究動機 2
1-3 研究目的 3
1-4研究架構 5
第二章文獻探討 6
2-1音頻特徵提取 6
2-1-1音頻特徵提取方法及應用 7
2-1-2 梅爾頻率倒譜係數（MFCC） 8
2-2 嬰兒哭聲分類模型 9
2-3 資料擴增方法 10
2-3-1 生成對抗網絡（GAN） 11
2-3-2 WaveGAN 12
第三章研究方法 14
3-1 研究流程 14
3-2 生成對抗網路（GAN） 16
3-2-1 WaveGAN 17
3-2-2 損失函數：WGAN-GP 18
3-2-3 相位混洗（Phase Shuffle） 20
3-3 哭聲音訊特徵提取 21
3-4 分類模型 25
3-4-1 遞迴式神經網路（RNN） 25
3-4-2 長短期記憶網絡（LSTM） 26
3-4-3 損失函數 28
第四章研究實驗 29
4-1資料蒐集 29
4-2資料生成 30
4-2-1 WaveGAN模型架構及參數設定 30
4-2-2 生成模型結果分析 36
4-2-3創建擴增資料集 38
4-3分類模型參數設定 41
4-4實驗結果分析 43
第五章研究結論及建議 52
5-1 研究結論 52
5-2 研究限制 53
5-3 未來研究建議 53
參考文獻 54

參考文獻

[1] Samuel, V., & Caplier, A. （2017）. Baby Cry Detection Using Mel Frequency
Cepstral Coefficients and Support Vector Machine. In 2017 12th IEEE International
Conference on Automatic Face & Gesture Recognition （FG 2017）（pp. 1-6）.
IEEE.
[2] Orlandi, S., Rouas, J. L., & Mehilane, M. （2016）. Analysis of Infant Cries for the
Early Detection of Language Impairments. In 2016 IEEE International Conference on
Acoustics, Speech and Signal Processing （ICASSP）（pp. 2354-2358）. IEEE.
[3] Liu, C., & Liu, B. （2018）. Recognition of Baby Crying Using Random Forest and
Support Vector Machine. In 2018 International Conference on Smart Computing and
Electronic Enterprise （ICSCEE）（pp. 1-5）. IEEE.
[4] Zhao, Y., & Li, X. （2020）. Baby Cry Sound Analysis and Recognition Based on
Deep Learning. In 2020 IEEE 3rd International Conference on Information
Communication and Signal Processing （ICICSP）（pp. 319-323）. IEEE.
[5] Ghorbel, O., Mahdi, W., & Jaziri, I. （2019）. A Deep Learning Approach for Baby
Cry Detection. In 2019 IEEE 8th International Conference on Advanced Software
Engineering & Its Applications （ASEA）（pp. 62-67）. IEEE.
[6] Zhang, X., & Li, H. （2021）. Enhancing Infant Cry Classification Using GANs for
Data Augmentation. Journal of Audio, Speech, and Music Processing, 2021（3）, 1-12.
[7] Zhou, W., & Wang, L. （2020）. The Role of GANs in Medical Image Analysis and
Data Augmentation. IEEE Transactions on Medical Imaging, 39（7）, 2345-2357.
[8] Hosseini, S. M., Cavuoto, L. A., & Mailloux, Z. （2019）. Identifying different types
of infant cries: a critical review. Pediatric Research, 85（2）, 135-140.
55
[9] Smith, J., Johnson, R., & Williams, T. （2017）. Classification of infant cry sounds
using machine learning techniques. Journal of Pediatrics, 143（6）, 756-761.
[10] Mittal, A., Kumar, R., & Singh, G. （2020）. Deep Learning-Based Infant Cry
Classification: A Study. IEEE Transactions on Computational Intelligence and AI in
Games, 12（4）, 515-521.
[11] Oord, A. V. D., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., ... &
Kavukcuoglu, K. （2016）. WaveNet: A generative model for raw audio. arXiv
preprint arXiv:1609.03499.
[12] Prenger, R., Valle, R., & Catanzaro, B. （2019）. WaveGlow: A flow-based generative
network for speech synthesis. arXiv preprint arXiv:1811.00002.
[13] Salamon, J., Bello, J. P., Farnsworth, A., & Kell, A. J. （2017）. Deep convolutional
neural networks and data augmentation for environmental sound classification. IEEE
Signal Processing Letters, 24（3）, 279-283.
[14] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., ... &
Bengio, Y. （2014）. Generative adversarial nets. In Advances in neural information
processing systems （pp. 2672-2680）.
[15] Karras, T., Aila, T., Laine, S., & Lehtinen, J. （2018）. Progressive growing of GANs
for improved quality, stability, and variation. In Proceedings of the International
Conference on Learning Representations （ICLR）.
[16] Karras, T., Laine, S., & Aila, T. （2019）. A style-based generator architecture for
generative adversarial networks. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition （CVPR）（pp. 4401-4410）.
56
[17] Brock, A., Donahue, J., & Simonyan, K. （2019）. Large scale GAN training for high
fidelity natural image synthesis. In Proceedings of the International Conference on
Learning Representations （ICLR）.
[18] Donahue, J., & Simonyan, K. （2019）. Large scale adversarial representation learning.
In Advances in Neural Information Processing Systems （NIPS）（pp. 10541-
10551）.
[19] Yamamoto, R., Song, E., & Kim, J. M. （2020）. Parallel WaveGAN: A fast waveform
generation model based on generative adversarial networks with multi-resolution
spectrogram. In Proceedings of the IEEE International Conference on Acoustics, Speech,
and Signal Processing （ICASSP）（pp. 6199-6203）.
[20] Engel, J., Agrawal, K. K., Chen, S., Gulrajani, I., Donahue, C., & Roberts, A.
（2019）. GANSynth: Adversarial neural audio synthesis. In Proceedings of the
International Conference on Learning Representations （ICLR）.
[21] Bińkowski, M., Donahue, C., Dieleman, S., Clark, A., Elsen, E., Casagrande, N., ... &
Simonyan, K. （2020）. High fidelity speech synthesis with adversarial networks. In
Proceedings of the International Conference on Learning Representations （ICLR）.
[22] Das, S., Nailwal, S., Raza, A., & Bhuyan, A. S. （2023）. Analysis of Different
Machine and Deep Learning Algorithms for Audio Classification. In 2023 First
International Conference on Advances in Electrical, Electronics and Computational
Intelligence （ICAEECI）（pp. 1-7）. IEEE.
https://doi.org/10.1109/ICAEECI58247.2023.10370821
[23] Miyazaki, K., Komatsu, T., Hayashi, T., Watanabe, S., Toda, T., & Takeda, K.
（2020）. Weakly-Supervised Sound Event Detection with Self-Attention. In 2020
57
IEEE International Conference on Acoustics, Speech, and Signal Processing
（ICASSP）（pp. 66-70）. IEEE.
https://doi.org/10.1109/ICASSP40776.2020.9053609
[24] Donahue, C., McAuley, J., & Puckette, M. （2018）. Synthesizing Audio with
Generative Adversarial Networks. arXiv preprint arXiv:1802.04208.
[25] Khalid, S., Khalil, T., & Nasreen, S. （2014）. A survey of feature selection and feature
extraction techniques in machine learning. In Proceedings of the 2014 Science and
Information Conference （pp. 372-378）. https://doi.org/10.1109/SAI.2014.6918213.
[26] Krishna, G., Tran, C., Carnahan, M., Han, Y., & Tewfik, A. H. （2021）. Generating
EEG features from acoustic features. In Proceedings of the 28th European Signal
Processing Conference （EUSIPCO）（pp. 1100-1104）.
https://doi.org/10.23919/Eusipco47968.2020.9287498.
[27] Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S., & Bharadwaj, A. （2019）.
Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal
Processing, 13（2）, 206-219.
[28] Sarma, C. M., & Dutta, P. （2017）. A Review of Feature Extraction Techniques in
Speech Processing. International Journal of Computer Applications, 169（6）, 22-25.
[29] Tzanetakis, G., & Cook, P. （2002）. Musical genre classification of audio signals.
IEEE Transactions on Speech and Audio Processing, 10（5）, 293-302.
[30] Rabiner, L., & Juang, B. H. （1993）. Fundamentals of Speech Recognition. Prentice
Hall.
58
[31] Muda, L., Begam, M., & Elamvazuthi, I. （2010）. Voice recognition algorithms using
Mel Frequency Cepstral Coefficient （MFCC） and Dynamic Time Warping
（DTW） techniques. Journal of Computing, 2（3）, 138-143.
[32] Li, C., & Chan, C. Y. （2019）. Real-time Automatic Music Genre Classification with
Convolutional Neural Networks. IEEE Access, 7, 41047-41056.
[33] Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A.-r., Jaitly, N., ... & Sainath, T. N.
（2012）. Deep Neural Networks for Acoustic Modeling in Speech Recognition: The
Shared Views of Four Research Groups. IEEE Signal Processing Magazine, 29（6）,
82-97.
[34] Elbir, A. M., & Mishra, K. V. （2021）. Cognitive radar signal processing with deep
learning networks. IEEE Signal Processing Magazine, 38（2）, 43-59.
[35] Hosseini, M., Cavuoto, L. A., & Mailloux, Z. （2019）. Robust feature extraction for
infant cry classification. Biomedical Signal Processing and Control, 47, 303-311.
[36] Kumar, A., & Zhang, Y. （2020）. Noise robust speech recognition using MFCC and
deep learning techniques. Journal of Computer Science and Technology, 35（2）, 359-
367.
[37] Zhao, X., & Li, Y. （2022）. Enhancing MFCC features using data augmentation
techniques for robust speech recognition. IEEE Transactions on Audio, Speech, and
Language Processing, 30, 543-554.
[38] Shorten, C., & Khoshgoftaar, T. M. （2019）. A survey on image data augmentation
for deep learning. Journal of Big Data, 6（1）, 1-48.
[39] Antoniou, A., Storkey, A., & Edwards, H. （2017）. Data augmentation generative
adversarial networks. arXiv preprint arXiv:1711.04340.
59
[40] Binkowski, M., Donahue, C., Dieleman, S., Clark, A., Elsen, E., Casagrande, N., ... &
Simonyan, K. （2020）. High Fidelity Speech Synthesis with Adversarial Networks.
arXiv preprint arXiv:1909.11646.
[41] Chou, J. C., Yeh, C. C., Lee, H. Y., & Lee, L. S. （2018）. Multi-target voice
conversion without parallel data by adversarially learning disentangled audio
representations. In Interspeech （pp. 501-505）.
[42] Donahue, C., McAuley, J., & Puckette, M. （2018）. Synthesizing audio with
generative adversarial networks. arXiv preprint arXiv:1802.04208.
[43] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., ... &
Bengio, Y. （2014）. Generative adversarial nets. In Advances in neural information
processing systems （pp. 2672-2680）.
[44] Kaneko, T., & Kameoka, H. （2018）. Cyclegan-vc: Non-parallel voice conversion
using cycle-consistent adversarial networks. In 2018 26th European Signal Processing
Conference （EUSIPCO）（pp. 2100-2104）. IEEE.
[45] Karras, T., Laine, S., & Aila, T. （2019）. A style-based generator architecture for
generative adversarial networks. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition （CVPR）（pp. 4401-4410）.
[46] Kumar, K., Kumar, R., de Boissiere, T., Gestin, L., Teoh, W. Z., Sotelo, J., ... &
Courville, A. （2019）. Melgan: Generative adversarial networks for conditional
waveform synthesis. Advances in Neural Information Processing Systems, 32.
[47] Sahu, P., Wang, J., & Yang, Z. （2020）. Enhancing speech recognition using
generative adversarial networks. IEEE Access, 8, 113086-113095.
60
[48] Shorten, C., & Khoshgoftaar, T. M. （2019）. A survey on image data augmentation
for deep learning. Journal of Big Data, 6（1）, 1-48.
[49] an den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., ... &
Kavukcuoglu, K. （2016）. Wavenet: A generative model for raw audio. arXiv preprint
arXiv:1609.03499.
[50] Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., & Chen, X.
（2016）. Improved techniques for training GANs. In NIPS, 2016. arXiv:1606.03498.
[51] Donahue, C., McAuley, J., & Puckette, M. （2018）. Synthesizing audio with
generative adversarial networks. arXiv preprint arXiv:1802.04208.
[52] Odena, A., Dumoulin, V., & Olah, C. （2016）. Deconvolution and checkerboard
artifacts. Distill. https://doi.org/10.23915/distill.00003.
[53] Arjovsky, M., Chintala, S., & Bottou, L. （2017）. Wasserstein GAN. arXiv preprint
arXiv:1701.07875.
[54] Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., & Courville, A. C. （2017）.
Improved training of Wasserstein GANs. In Advances in Neural Information Processing
Systems （pp. 5767-5777）.
[55] Zhang, Y. F., Fitch, P., & Thorburn, P. J. （2020）. Predicting the Trend of Dissolved
Oxygen Based on the kPCA-RNN Model. Water, 12（2）, 585.
https://doi.org/10.3390/w12020585.
[56] Olah, C. （2015）. Understanding LSTM Networks. Retrieved from
https://colah.github.io/posts/2015-08-Understanding-LSTMs/.

指導教授

許秉瑜(Ping-Yu Hsu)

審核日期

2024-7-15

推文