摘要(英) |
Music is a part of people’s life nowadays. The familiar melodies can be heard everywhere. Sometimes, we would hum the melody which is similar to the unknown but familiar melody appearing in our mind in order to find out the song including that melody. Thus, the query of singing/humming (QbSH) system is developed. According to where the features are extracted from, we propose two QbSH systems, called Dai-ChouNet27 and QBSHNet03. Dai-ChouNet27, designed with reference to the architecture of DaiNet34 which outperforms other models for the environmental sound recognition task, is almost fully convolutional neural network and the last two layers are fully-connected layers. The first layer of Dai-ChouNet27 with large size of kernel is used to filter out the noise in raw waveforms. Several convolutional layers are used to extract high-level features from raw waveforms except the first convolutional layer. Then, the last two layers are fully-connected layers used to classify the features and gain the results.
QBSHNet03 is a QbSH system that combines Shazam algorithm and convolutional neural network (CNN). In QBSHNet03, the time-domain waveforms are filtered by ConvRBM in order to eliminate the noise in waveform. Features including frequency and time difference are extracted from the spectrograms translated with Short-time Fourier transform (STFT) by Shazam algorithm. After extracting features, several convolutional layers and two fully-connected layers are used to classify the features to obtain the results.
There are three different datasets used to train and test QBSHNet03, Dai-ChouNet27, and DaiNet34. The three different datasets are MIR-QbSH dataset, dataset of Taiwan’s common children songs, and dataset of classical English songs. In MIR-QbSH dataset, the performance of Dai-ChouNet27 is much better than the performance of QBSHNet03 and DaiNet34. The training accuracy and MRR of Dai-ChouNet27 are up to 99% and 0.99, respectively. Moreover, the testing accuracy, MRR, precision, and recall of Dai-ChouNet27 are up to 84%, 0.88, 0.78, and 0.74, respectively. According to the results, for the QbSH task, the features extracted directly from raw waveforms are more suitable than the features extracted from spectrograms. After comparing the results of different length of clips and variable levels of SNR in the three datasets, Dai-ChouNet27 achieves outstanding performance if the datasets are large enough. If Dai-ChouNet27 is trained and tested with suitable length of clips and the level of SNR that Dai-ChouNet27 can still achieve better performance, the accuracy and MRR of training and testing are up to 84% and 0.87, respectively, moreover, the testing precision/recall are up to 0.7. |
參考文獻 |
Avery L. C. W. (2003). An Industrial Strength Audio Search Algorithm. ISMIR 2003, Proceedings of the 4th International Conference on Music Information Retrieval, Baltimore, Maryland, USA.
Alan V. O. & Ronald W. S. (2010). Discrete-Time Signal Processing, Third Edition. London: Prentice-Hall.
Dieleman S. & Schrauwen B. (2014). End-to-end learning for music audio. 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6964-6968.
Dai, W., Dai, C., Qu, S., Li, J. & Das, S. (2017) Very deep convolutional neural networks for raw waveforms. 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 421-425, 2017.
Epri W. P. (2021) Temporal and Spectral Analysis of Children Song Perception with Different Simulated Cochlear Implant Coding Strategies. Academic thesis/ Master dissertation. National Central University.
Ghias A. J., Logan J., Chamberlin D., & Smith B. C. (1995). Query by humming-musical information retrieval in an audio database. Proc. ACM Multimedia’ 95, San Francisco, pp. 216–221.
Ho L.-L., Wu C.-M., Huang K.-Y., & Lin H.-C. (2009). Effects of channel number, stimulation rate, and electroacoustic stimulation of cochlear implant simulation on melody recognition in quiet and noise conditions. Proc. of the 7-th Asia Pacific Symposium on Cochlear Implants and Related Sciences (APSCI), Singapore Dec.1-4. P3-32.
He, K., Zhang, X., Ren, S. & Sun, J. (2016). Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770-778, 2016.
Jang, J.-S. R., "MIR-QBSH Corpus", available at "http://mirlab.org/dataset/public/MIR-QBSH-corpus.rar" since 2006. 2022/07/11
Kong, Q., Cao, Y. Turab, I., Wang, Y., Wang, W., & Mark, D. P. (2020). PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 28, pp. 2880-2894.
Kim, K., Park, K. R., Park, S. J., Lee, S. P., & Kim, M. Y. (2011). Robust query-by-singing/humming system against background noise environments. IEEE Transactions on Consumer Electronics, Vol. 57, No. 2, pp. 720-725.
Kim, Y. & Park, C. H. (2013). Query by Humming by Using Scaled Dynamic Time Warping. 2013 International Conference on Signal-Image Technology & Internet-Based Systems, pp. 1-5, 2013.
Kim, T., Lee, J., & Nam, J. (2019). Comparison and Analysis of SampleCNN Architectures for Audio Classification. IEEE Journal of Selected Topics in Signal Processing, Vol. 13, No. 2, pp. 285-297.
Leon C. (1995). Time-Frequency Analysis. Prentice-Hall, New York.
Park, H., & Yoo, C. D. (2020). CNN-Based Learnable Gammatone Filterbank and Equal-Loudness Normalization for Environmental Sound Classification. IEEE Signal Processing Letters, Vol. 27, pp. 411-415.
Park, M., Kim, H.-R., & Yang, S. H. (2006). Frequency-temporal filtering for a robust audio fingerprinting scheme in real-noise environments. ETRI journal, vol. 28, no. 4, pp. 509-512.
Song, C. J., Park, H., Yang, C. M., Jang, S. J., & Lee, S. P. (2013). Implementation of a Practical Query-by-Singing/Humming (QbSH) System and Its Commercial Applications. IEEE Transactions on Consumer Electronics, Vol. 59, No. 2, pp. 407-414.
Son, W., Cho, H. T., Yoon, K., & Lee, S. P. (2010). Sub-fingerprint masking for a robust audio fingerprinting system in a real-noise environment for portable consumer devices. IEEE Transactions on Consumer Electronics, Vol. 56, No. 1, pp. 156-160.
Sailor, H. B. & Patil, H. A. (2016). Novel Unsupervised Auditory Filterbank Learning Using Convolutional RBM for Speech Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 24, No. 12, pp. 2341-2353.
Wang, C. C. & Jang J. S. R. (2015). Improving Query-by-Singing/Humming by Combining Melody and Lyric Information. IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 23, No. 4, pp.798-806.
齋藤康毅. (2017). Deep Learning: 用Python進行深度學習的基礎理論實作. 美商歐萊禮股份有限公司台灣分公司. 臺北市. |