深度學習改善哼唱搜尋音樂系統

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：18

、訪客IP：3.128.197.221

姓名

周怡蓁(Yi-Chen, Chou) 查詢紙本館藏

畢業系所

電機工程學系

論文名稱

深度學習改善哼唱搜尋音樂系統
(Query by Singing/Humming System Improved by Deep Learning)

相關論文

★ 獨立成份分析法於真實環境中聲音訊號分離之探討	★ 口腔核磁共振影像的分割與三維灰階值內插
★ 數位式氣喘尖峰氣流量監測系統設計	★ 結合人工電子耳與助聽器對中文語音辨識率的影響
★ 人工電子耳進階結合編碼策略的中文語音辨識成效模擬--結合助聽器之分析	★ 中文發聲之神經關聯性的腦功能磁振造影研究
★ 利用有限元素法建構3維的舌頭力學模型	★ 以磁振造影為基礎的立體舌頭圖譜之建構
★ 腎小管之草酸鈣濃度變化與草酸鈣結石關係之模擬研究	★ 口腔磁振影像舌頭構造之自動分割
★ 微波輸出窗電性匹配之研究	★ 以軟體為基準的助聽器模擬平台之發展-噪音消除
★ 以軟體為基準的助聽器模擬平台之發展-回饋音消除	★ 模擬人工電子耳頻道數、刺激速率與雙耳聽對噪音環境下中文語音辨識率之影響
★ 用類神經網路研究中文語音聲調產生之神經關聯性	★ 教學用電腦模擬生理系統之建構

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

音樂是現代人的生活的一部份，隨處都能聽到熟悉的旋律，當腦海中浮現一段不知名卻熟悉的旋律，會透過哼唱的方式模仿這段旋律的音調和節拍，哼唱搜尋音樂系統就此產生。本論文根據提取特徵的來源提出兩個哼唱搜尋音樂系統，分別為Dai-ChouNet27和QBSHNet03，Dai-ChouNet27為參考在環境聲音分類有較佳的表現的DaiNet34的架構所設計出來的哼唱搜尋音樂系統，屬於完全卷積神經網路加上全連接層，含有大尺寸Kernel的卷積層對原始波形進行濾波除噪，再透過多層卷積層直接從原始波形中提取特徵，最後的兩層全連接層完成分類。
而QBSHNet03為結合Shazam演算法和卷積神經網路(Convolutional Neural Network, CNN)提出的哼唱搜尋音樂系統，透過ConvRBM進行濾波除噪，參考Shazam演算法從聲譜圖(Spectrogram)上提取包含頻率和時間差的特徵，最後以多層卷積層和兩層全連接層對特徵組合完成分類。
本論文透過MIR-QbSH語料庫、台灣常見之兒歌語料庫和經典英文歌曲語料庫來訓練及測試Dai-ChouNet27、QBSHNet03和DaiNet34，在MIR-QbSH語料庫中，Dai-ChouNet27的表現明顯優於QBSHNet03和DaiNet34，Dai-ChouNet27的訓練準確率/MRR高達99%/0.99，測試準確率/MRR/精確率/召回率最高達到84%/0.88/0.78/0.74，表示從原始波形提取的特徵較適合哼唱搜尋音樂系統。而在三種語料庫中，透過比較不同片段和噪音程度的訓練/測試結果，Dai-ChouNet27在足夠大的數據集都有傑出的表現，在適合的片段長度和可承受的噪音程度下，訓練/測試的準確率和MRR皆達到84%和0.87以上，且精確率和召回率皆達到0.7以上。

摘要(英)

Music is a part of people’s life nowadays. The familiar melodies can be heard everywhere. Sometimes, we would hum the melody which is similar to the unknown but familiar melody appearing in our mind in order to find out the song including that melody. Thus, the query of singing/humming (QbSH) system is developed. According to where the features are extracted from, we propose two QbSH systems, called Dai-ChouNet27 and QBSHNet03. Dai-ChouNet27, designed with reference to the architecture of DaiNet34 which outperforms other models for the environmental sound recognition task, is almost fully convolutional neural network and the last two layers are fully-connected layers. The first layer of Dai-ChouNet27 with large size of kernel is used to filter out the noise in raw waveforms. Several convolutional layers are used to extract high-level features from raw waveforms except the first convolutional layer. Then, the last two layers are fully-connected layers used to classify the features and gain the results.
QBSHNet03 is a QbSH system that combines Shazam algorithm and convolutional neural network (CNN). In QBSHNet03, the time-domain waveforms are filtered by ConvRBM in order to eliminate the noise in waveform. Features including frequency and time difference are extracted from the spectrograms translated with Short-time Fourier transform (STFT) by Shazam algorithm. After extracting features, several convolutional layers and two fully-connected layers are used to classify the features to obtain the results.
There are three different datasets used to train and test QBSHNet03, Dai-ChouNet27, and DaiNet34. The three different datasets are MIR-QbSH dataset, dataset of Taiwan’s common children songs, and dataset of classical English songs. In MIR-QbSH dataset, the performance of Dai-ChouNet27 is much better than the performance of QBSHNet03 and DaiNet34. The training accuracy and MRR of Dai-ChouNet27 are up to 99% and 0.99, respectively. Moreover, the testing accuracy, MRR, precision, and recall of Dai-ChouNet27 are up to 84%, 0.88, 0.78, and 0.74, respectively. According to the results, for the QbSH task, the features extracted directly from raw waveforms are more suitable than the features extracted from spectrograms. After comparing the results of different length of clips and variable levels of SNR in the three datasets, Dai-ChouNet27 achieves outstanding performance if the datasets are large enough. If Dai-ChouNet27 is trained and tested with suitable length of clips and the level of SNR that Dai-ChouNet27 can still achieve better performance, the accuracy and MRR of training and testing are up to 84% and 0.87, respectively, moreover, the testing precision/recall are up to 0.7.

關鍵字(中)

★ 哼唱搜學音樂系統
★ 深度學習
★ 卷積神經網路
★ Shazam演算法

關鍵字(英)

★ Query by singing/humming (QbSH) system
★ Deep learning
★ Convolutional neural network
★ Shazam algorithm

論文目次

目錄
摘要 I
Abstract III
圖目錄 VIII
表目錄 X
第一章緒論 1
1.1 研究動機 1
1.2 文獻回顧 3
1.2.1 傳統哼唱搜尋音樂系統 3
1.2.2 深度學習模型 7
1.2.3 濾波器 9
1.3 研究目的與貢獻 10
1.4 論文架構 12
第二章研究背景及相關原理 14
2.1 哼唱搜尋音樂系統 14
2.1.1 傅立葉轉換(Fourier Transform) 14
2.1.2 快速傅立葉轉換(FFT) 15
2.1.3 短時傅立葉轉換(STFT) 17
2.1.4 Shazam演算法 18
2.2 Gammatone濾波器 20
2.3 類神經網路(Neural Network)之概述 20
2.3.1 激活函數(Activation Function) 22
2.3.2 損失函數(Loss Function) 25
2.4 深度學習模型 27
2.4.1 受限玻爾茲曼機 28
2.4.2 卷積神經網路(Convolutional Neural Network, CNN) 29
2.4.3 殘差學習(Residual Learning) 34
2.5 結論 35
第三章語料庫和演算法架構 36
3.1 語料庫 36
3.1.1 MIR-QbSH語料庫 37
3.1.2 台灣常見之兒歌 38
3.1.3 經典英文歌曲 38
3.2 哼唱搜尋音樂系統(QBSHNet03) 39
3.2.1 ConvRBM和聲譜圖 41
3.2.2 提取特徵 44
3.2.3 卷積神經網路 46
3.3 哼唱搜尋音樂系統(Dai-ChouNet27) 46
3.4 結論 51
第四章研究結果與討論 53
4.1 軟硬體規格 53
4.2 研究結果 54
4.2.1 MIR-QbSH語料庫 56
4.2.2 台灣常見之兒歌語料庫 69
4.2.3 經典英文歌曲語料庫 74
4.3 結果討論 80
第五章結論與未來展望 84
5.1 結論 84
5.2 未來展望 87
參考文獻Reference 88

參考文獻

Avery L. C. W. (2003). An Industrial Strength Audio Search Algorithm. ISMIR 2003, Proceedings of the 4th International Conference on Music Information Retrieval, Baltimore, Maryland, USA.

Alan V. O. & Ronald W. S. (2010). Discrete-Time Signal Processing, Third Edition. London: Prentice-Hall.

Dieleman S. & Schrauwen B. (2014). End-to-end learning for music audio. 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6964-6968.

Dai, W., Dai, C., Qu, S., Li, J. & Das, S. (2017) Very deep convolutional neural networks for raw waveforms. 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 421-425, 2017.

Epri W. P. (2021) Temporal and Spectral Analysis of Children Song Perception with Different Simulated Cochlear Implant Coding Strategies. Academic thesis/ Master dissertation. National Central University.

Ghias A. J., Logan J., Chamberlin D., & Smith B. C. (1995). Query by humming-musical information retrieval in an audio database. Proc. ACM Multimedia’ 95, San Francisco, pp. 216–221.

Ho L.-L., Wu C.-M., Huang K.-Y., & Lin H.-C. (2009). Effects of channel number, stimulation rate, and electroacoustic stimulation of cochlear implant simulation on melody recognition in quiet and noise conditions. Proc. of the 7-th Asia Pacific Symposium on Cochlear Implants and Related Sciences (APSCI), Singapore Dec.1-4. P3-32.

He, K., Zhang, X., Ren, S. & Sun, J. (2016). Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770-778, 2016.

Jang, J.-S. R., "MIR-QBSH Corpus", available at "http://mirlab.org/dataset/public/MIR-QBSH-corpus.rar" since 2006. 2022/07/11

Kong, Q., Cao, Y. Turab, I., Wang, Y., Wang, W., & Mark, D. P. (2020). PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 28, pp. 2880-2894.

Kim, K., Park, K. R., Park, S. J., Lee, S. P., & Kim, M. Y. (2011). Robust query-by-singing/humming system against background noise environments. IEEE Transactions on Consumer Electronics, Vol. 57, No. 2, pp. 720-725.

Kim, Y. & Park, C. H. (2013). Query by Humming by Using Scaled Dynamic Time Warping. 2013 International Conference on Signal-Image Technology & Internet-Based Systems, pp. 1-5, 2013.

Kim, T., Lee, J., & Nam, J. (2019). Comparison and Analysis of SampleCNN Architectures for Audio Classification. IEEE Journal of Selected Topics in Signal Processing, Vol. 13, No. 2, pp. 285-297.

Leon C. (1995). Time-Frequency Analysis. Prentice-Hall, New York.

Park, H., & Yoo, C. D. (2020). CNN-Based Learnable Gammatone Filterbank and Equal-Loudness Normalization for Environmental Sound Classification. IEEE Signal Processing Letters, Vol. 27, pp. 411-415.

Park, M., Kim, H.-R., & Yang, S. H. (2006). Frequency-temporal filtering for a robust audio fingerprinting scheme in real-noise environments. ETRI journal, vol. 28, no. 4, pp. 509-512.

Song, C. J., Park, H., Yang, C. M., Jang, S. J., & Lee, S. P. (2013). Implementation of a Practical Query-by-Singing/Humming (QbSH) System and Its Commercial Applications. IEEE Transactions on Consumer Electronics, Vol. 59, No. 2, pp. 407-414.

Son, W., Cho, H. T., Yoon, K., & Lee, S. P. (2010). Sub-fingerprint masking for a robust audio fingerprinting system in a real-noise environment for portable consumer devices. IEEE Transactions on Consumer Electronics, Vol. 56, No. 1, pp. 156-160.

Sailor, H. B. & Patil, H. A. (2016). Novel Unsupervised Auditory Filterbank Learning Using Convolutional RBM for Speech Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 24, No. 12, pp. 2341-2353.

Wang, C. C. & Jang J. S. R. (2015). Improving Query-by-Singing/Humming by Combining Melody and Lyric Information. IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 23, No. 4, pp.798-806.

齋藤康毅. (2017). Deep Learning: 用Python進行深度學習的基礎理論實作. 美商歐萊禮股份有限公司台灣分公司. 臺北市.

指導教授

吳炤民(Chao-Min Wu)

審核日期

2022-8-24

推文