基於深度學習之AAC壓縮域翻唱歌快速檢索

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：81

、訪客IP：3.142.156.67

姓名

張育瑞(Yu-ruey Chang) 查詢紙本館藏

畢業系所

通訊工程學系

論文名稱

基於深度學習之AAC壓縮域翻唱歌快速檢索
(Fast Cover Song Retrieval in AAC Domain based on Deep Learning)

相關論文

★ 基於區域權重之衛星影像超解析技術	★ 延伸曝光曲線線性特性之調適性高動態範圍影像融合演算法
★ 實現於RISC架構之H.264視訊編碼複雜度控制	★ 基於卷積遞迴神經網路之構音異常評估技術
★ 具有元學習分類權重轉移網路生成遮罩於少樣本圖像分割技術	★ 具有注意力機制之隱式表示於影像重建三維人體模型
★ 使用對抗式圖形神經網路之物件偵測張榮	★ 基於弱監督式學習可變形模型之三維人臉重建
★ 以非監督式表徵分離學習之邊緣運算裝置低延遲樂曲中人聲轉換架構	★ 基於序列至序列模型之 FMCW雷達估計人體姿勢
★ 基於多層次注意力機制之單目相機語意場景補全技術	★ 基於時序卷積網路之單FMCW雷達應用於非接觸式即時生命特徵監控
★ 視訊隨選網路上的視訊訊務描述與管理	★ 基於線性預測編碼及音框基頻週期同步之高品質語音變換技術
★ 基於藉語音再取樣萃取共振峰變化之聲調調整技術	★ 即時細緻可調性視訊在無線區域網路下之傳輸效率最佳化研究

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

隨著多媒體資料的增加，如何從龐大的資料庫中快速找到使用著有興趣的資料成為愈來愈重要的議題。傳統資料檢索的方法大多使用關鍵字來做搜尋，但需要大量人力來為資料先做標記，隨著資料量的增加，關鍵字標記的方法變得較不具彈性。內涵式檢索方法是較自然的方式，也可以避免不同人對同一首歌給定標記不一樣的問題。
本論文針對現今網路常見的音樂格式AAC，提出做在AAC壓縮域的翻唱歌快速檢索，其利用部分解碼後的MDCT係數，對應到Chroma特徵，再將多個數量的音框合成音段，作為深度學習的輸入，藉由學習自動找出更能代表音樂的關鍵特徵，並經由稀疏自編碼器把歌曲進行降維，改善傳統方法比對時間過長的問題。實驗結果顯示，所提出之方法其檢索效能MRR值為0.505，與相關文獻檢索方法相比，也節省約70%以上的比對時間。

摘要(英)

With the increasing of multimedia data, it becomes more and more important to quickly search the interests from large databases. Keyword annotation is the traditional approach, but it needs large amount of manual effort to annotate the keyword. As the size of data increases, the keyword annotation approach becomes infeasible. Content-based retrieval is more natural, it extracts features from music content to create a representation that overcomes human labeling errors.
This thesis focuses on the AAC file which is widely used by streaming internet sources. Here, the proposed system directly maps the modified discrete cosine transform coefficients (MDCT) into a 12-dimensional chroma feature. We combine frames to a segment as the input of deep learning, deep learning can automatically find more meaningful features of music data. We also applied sparse autoencoder to reduce dimensionality of songs. With these efforts, significant matching time can be saved. The experimental results show that the proposed method can reach 0.505 of mean reciprocal rank (MRR) and save over 70% matching time compared with conventional approaches.

關鍵字(中)

★ 音樂檢索
★ 翻唱歌曲
★ AAC
★ 深度學習

關鍵字(英)

★ music information retrieval
★ cover song
★ AAC
★ deep learning

論文目次

摘要 I
Abstract II
目　錄 III
圖目錄 V
表目錄 VII
第一章緒論 1
1.1　研究背景 1
1.2　研究動機與目的 2
1.3　論文架構 3
第二章音樂檢索與音訊壓縮技術簡介 4
2.1　音樂檢索之簡介 4
2.1.1內涵式音樂檢索 5
2.1.2翻唱歌曲辨識 6
2.2　音訊壓縮技術簡介 8
2.3　原始域音樂檢索相關文獻簡介 12
2.4　壓縮域音樂檢索相關文獻簡介 13
2.5　部分解碼 14
2.6　音訊特徵擷取 15
2.6.1特徵擷取 15
2.6.2音段切割 18
2.7　相似度比對 19
2.7.1 OTI and Chroma Similarity Matrix 19
2.7.2動態時間扭曲 21
第三章深度學習 23
3.1　倒傳遞神經網路 24
3.2　深度信念網路 29
3.3　自編碼器 32
第四章提出之深度學習檢索方法 36
4.1　系統架構 36
4.1.1能量正規化和區塊分割 37
4.1.2 訓練階段 38
4.1.3測試階段 40
4.2　實驗數據 41
第五章結論及未來展望 47
參考文獻 49

參考文獻

[1] 侯志欽, 聲學原理與多媒體音訊科技, 初版 ed. 台北市: 台灣商務, 2007.
[2] J. Serrà, E. Gómez, and P. Herrera, "Audio cover song identification and similarity: background, approaches, evaluation, and beyond," in Advances in Music Information Retrieval, ed: Springer, 2010, pp. 307-332.
[3] Music Information Retrieval Evaluation eXchange [Online]. Available: http://www.music-ir.org/mirex/wiki/MIREX_HOME
[4] ISO/IEC 13818-7 (1997) Information technology – Generic coding of moving pictures and associated audio information, Part 7: Advanced Audio Coding.
[5] E. Zwicker and H. Fastl, Psychoacoustics - Facts and Models, Springer Berlin, Heidelberg, 1990.
[6] T. M. Chang, "Chord Transformation and Performance Analysis for Compressed Audio," Ph.D. dissertation, Dept. Comm. Eng., National Central University, 2014.
[7] ISO/IEC DIS 14496-3 (1999) Information Technology - Coding of audio-visual objects, Part 3: Audio.
[8] C.T. Day, "Temporal Multi- Descriptors for Content Based Music Retrieval," M.S. thesis, Dept. Comm. Eng., National Central University, 2014.
[9] D. P. W. Ellis, and G.E. Poliner, “Identifying ‘Cover Songs’ with Chroma Features and Dynamic Programming Beat Tracking,” in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, Honolulu, Hawaii, U.S.A., 2007, pp. 1429-1432.
[10] J. Serra and E. Gomez, “Audio cover song identification based on tonal sequence alignment,” in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, Las Vegas, Nevada, U.S.A., March 30- April 4, 2008, pp.61-64.
[11] S. Ravuri and D. P. W. Ellis, "Cover song detection: From high scores to general classification," in Proc. IEEE Int. Conf. Acoustics Speech and Signal Processing (ICASSP), 2010, pp. 65-68.
[12] Z. C. Cheng, C. S. Lin, and Y. H. Chen, “Fast Music Information Retrieval Using PAT Tree Based Dynamic Time Warping,” in Proc. Int. Conf. on Communications and Signal Processing, Singapore, Dec. 2011, pp. 1 – 5.
[13] D. P. W. Ellis and B. M. Thierry, "Large-scale cover song recognition using the 2d fourier transform magnitude," in The 13th international society for music information retrieval conference, 2012, pp. 241-246.
[14] T. H. Tsai and Y. T. Wang, “Content-Based Retrieval of Audio Example on MP3 Compression Domain,” in Proc. IEEE 6th Workshop on Multimedia Signal Processing, Sep. 2004, pp.123-126.
[15] T. H. Tsai and W. C. Chang, “Two-Stage Method for Specific Audio Retrieval based on MP3 Compression Domain,” in Proc. IEEE International Symposium on Circuits and Systems, May. 2009, pp. 713-716.
[16] E. Ravelli, G. Richard, and L. Daudet, “Audio signal representations for indexing in the transform domain,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 3, pp. 434-446, 2010.
[17] T. M. Chang, E. T. Chen, C. B. Hsieh, and P. C. Chang, “Cover Song Identification with Direct Chroma Feature Extraction From AAC Files,” in Proc. of GCCE, Tokyo, Japan, Oct. 2013, pp. 55-56.
[18] Y. T. Chung, T. M. Chang, P. C. Chang, “Classical Music Retrieval Based on Accumulated Path Similarity in AAC Compression Domain,” in Proc. of International Conference on Internet Multimedia Computing and Service (ICIMCS), Xiamen China, July. 2014, pp. 189-192.
[19] Z. C. Cheng, C. S. Lin, and Y. H. Chen, “Fast Music Information Retrieval Using PAT Tree Based Dynamic Time Warping,” in Proc. Int. Conf. on Communications and Signal Processing, Singapore, Dec. 2011, pp. 1 – 5.
[20] Y. LeCun, Y. Bengio, and G. Hinton, "Deep learning," Nature, vol. 521, pp. 436-444, 05/28/print 2015.
[21] The MNIST database of handwritten digits [Online].
Available: http://yann.lecun.com/exdb/mnist/
[22] B. Kwolek, "Face detection using convolutional neural networks and Gabor filters," in Artificial Neural Networks: Biological Inspirations–ICANN 2005, ed: Springer, 2005, pp. 551-556.
[23] T. N. Sainath, A. R. Mohamed, B. Kingsbury, and B. Ramabhadran, "Deep convolutional neural networks for LVCSR," in IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), 2013, pp. 8614-8618.
[24] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa, "Natural language processing (almost) from scratch," The Journal of Machine Learning Research, vol. 12, pp. 2493-2537, 2011.
[25] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, "Learning representations by back-propagating errors," Nature, vol. 323, pp. 533-536, 10/09/print 1986.
[26] (2015, August 17). Deep Learning Tutorial (Release 0.1 ed.) [Online]. Available: http://deeplearning.net/tutorial/deeplearning.pdf
[27] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, "Greedy layer-wise training of deep networks," Advances in neural information processing systems, vol. 19, p. 153, 2007.
[28] G. Casella and E. I. George, "Explaining the Gibbs sampler," The American Statistician, vol. 46, pp. 167-174, 1992.
[29] A. Mnih and G. Hinton, "Learning nonlinear constraints with contrastive backpropagation," in Proc. IEEE International Joint Conference on Neural Networks (IJCNN), 2005, pp. 1302-1307.
[30] V. Nair and G. E. Hinton, "3D object recognition with deep belief nets," in Advances in Neural Information Processing Systems, 2009, pp. 1339-1347.
[31] A. Mohamed, G. Dahl, and G. Hinton, "Deep Belief Networks for phone recognition," NIPS 22 workshop on deep learning for speech recognition, 2009.
[32] G. Hinton, D. Li, Y. Dong, G. E. Dahl, A. Mohamed, N. Jaitly, et al., "Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups," Signal Processing Magazine, IEEE, vol. 29, pp. 82-97, 2012.
[33] G. Hinton, “A practical guide to training restricted Boltzmann machines,” Machine Learning Group, University of Toronto, Technical report, 2010.
[34] M. A. Keyvanrad and M. M. Homayounpour. (2014, August 1, 2014). A brief survey on deep belief networks and introducing a new object oriented MATLAB toolbox (DeeBNet V2.2). ArXiv e-prints 1408, 3264. Available: http://adsabs.harvard.edu/abs/2014arXiv1408.3264K
[35] G. Hinton, S. Osindero, and Y. Teh, "A Fast Learning Algorithm for Deep Belief Nets," Neural Computation, vol. 18, pp. 1527-1554, 2006.
[36] A. Ng, "Sparse autoencoder," CS294A Lecture notes, vol. 72, 2011.
[37] Y. Bengio, A. Courville, and P. Vincent, "Representation Learning: A Review and New Perspectives," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, pp. 1798-1828, 2013.
[38] The Covers 80 cover song data set, [Online].
Available: http://labrosa.ee.columbia.edu/projects/coversongs/covers80/
[39] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, "Dropout: A simple way to prevent neural networks from overfitting," The Journal of Machine Learning Research, vol. 15, pp. 1929-1958, 2014.
[40] R. B. Palm, "Prediction as a candidate for learning deep hierarchical models of data," Technical University of Denmark, 2012.

指導教授

張寶基

審核日期

2015-11-16

推文