視覺化語音辨識暨密碼驗證使用時空特徵與稀疏表示分類器

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：25

、訪客IP：18.117.73.223

姓名

柯奧福(Aufaclav Zatu Kusuma Frisky) 查詢紙本館藏

畢業系所

資訊工程學系在職專班

論文名稱

視覺化語音辨識暨密碼驗證使用時空特徵與稀疏表示分類器
(Visual Speech Recognition and Password Verification Using Local Spatiotemporal Features and Kernel Sparse Representation Classifier)

相關論文

★ Single and Multi-Label Environmental Sound Recognition with Gaussian Process	★ 波束形成與音訊前處理之嵌入式系統實現
★ 語音合成及語者轉換之應用與設計	★ 基於語意之輿情分析系統
★ 高品質口述系統之設計與應用	★ 深度學習及加速強健特徵之CT影像跟骨骨折辨識及偵測
★ 基於風格向量空間之個性化協同過濾服裝推薦系統	★ RetinaNet應用於人臉偵測
★ 金融商品走勢預測	★ 整合深度學習方法預測年齡以及衰老基因之研究
★ 漢語之端到端語音合成研究	★ 基於 ARM 架構上的 ORB-SLAM2 的應用與改進
★ 基於深度學習之指數股票型基金趨勢預測	★ 探討財經新聞與金融趨勢的相關性
★ 基於卷積神經網路的情緒語音分析	★ 運用深度學習方法預測阿茲海默症惡化與腦中風手術存活

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

當研究導向安全、生物特徵、與人機互動的辨識系統時，視覺化語音辨識應用在多面向的人類生活中扮演了一個重要的角色。在本論文中，我們提出了兩種系統。在第一個系統中，我們提出一個使用時域空間特徵描述子的字母辨識系統。提出的系統使用非負矩陣分解來降低特徵維度並且使用核化稀疏表示分類器做辨識。我們使用局部紋理與局部時間表示視覺化嘴唇資料。首先，視覺化嘴唇資料經由影像對比度增強做前處理並取出特徵。在我們的實驗中，半語者相依、語者獨立、語者相依分別在AVLetters資料庫中取得67.13%、45.37%、63.12%的正確率。同時我們也使用AVLetters 2資料庫將我們的方法與其他方法做比較。在相同配置下，我們的方法可以在語者相依條件達到89.02%及在語者獨立條件下達到25.9%的正確率。這樣的結果顯示了我們的方法在相同配置下比其他方法更加傑出。
在第二個系統中，我們提出使用信任點以唇語做為密碼用於家庭入口安全的家庭自動化系統。我們提出使用L2-Helinger對時域空間描述特徵做正規化的修改版新特徵，並且使用二維半非負矩陣分解降低維度。在辨識器部分，我們提出前饋-反饋核化稀疏表示分類器。我們的實驗結果證實了我們的系統對密碼辨識更具強健性。我們在AVLetters 2資料庫使用這個系統。在實驗中使用AVLetters 2資料庫產生長度為五個字母組合的十種視覺化密碼的所有組合，結果顯示我們的系統在密碼驗證表現非常好。在更複雜的實驗中，我們也證實了提出的系統在實際應用中可以實作在合理的時間內進行辨識。

摘要(英)

Visual speech recognition (VSR) applications play an important role in various aspects of human life, with research efforts being put into recognition systems in security, biometrics, and human machine interaction. In this thesis, we proposed two lip-based systems. First system, we proposed a letter recognition system using spatiotemporal features descriptors. The proposed system adopted non-negative matrix factorization (NMF) to reduce the dimensionality of the feature and kernel sparse representation classifier for classification step. We used local texture and local temporal features to represent the visual lips data. Firstly, the visual lips data were preprocessed by enhancing the contrast of images and then used to extract the feature. In our experiment, the promising accuracies of 67.13%, 45.37%, and 63.12% can be achieved in semi speaker dependent, speaker independent, and speaker dependent on AVLetters database. We also compared our method with other methods on AVLetters 2 database. Using the same configuration, our method could achieve accuracy rate of 89.02% for speaker dependent case and 25.9% for speaker independent case. This result shows that our method outperforms the others in the same configuration.
In the second system, we proposed a new approach in lip-based password for home entrance security using confidence point in home automation system. We also proposed new features using modified version of spatiotemporal descriptor features adopt L2-Hellinger to do a normalization and used two-dimension semi non-negative matrix factorization (2D Semi-NMF) for dimensionality reduction. In classifier parts, we proposed forward-backward kernel sparse representation classifier (FB-KSRC). Our experiment results proves that our system is quite robust to classify the password. We applied this system in AVLetters 2 dataset. Using ten visual passwords of five combined letters from AVLetters 2 dataset, using all combination experiments, the result shows that our system can verify the password very well. In the complexity experiment, we also get a reasonable time classification process if our system will be implemented in real world application.

關鍵字(中)

★ 內核稀疏表示
★ 本地時空描述
★ 可視化語音識別
★ 嘴唇密碼驗證

關鍵字(英)

★ Kernel Sparse Representation
★ Local Spatiotemporal Descriptor
★ Visual Speech Recognition
★ Lips password Verification

論文目次

摘要………………..……..…………………………………………………….......................i
ABSTRACT……………..….…………………………………………………………...…….ii
AKNOWLEDGEMENTS…………………………………………………………………….iii
LIST OF FIGURES….....…………….………………………………………………………..vi
LIST OF TABLES…………………………………………………………………………...viii
I. INTRODUCTION ................................................................................................................... 1
II. RELATED WORKS .............................................................................................................. 5
2-1. Feature Extraction ....................................................................................................... 5
2-1-1. Appearance Based Feature ............................................................................... 6
2-2-1. Spatiotemporal Based Feature ....................................................................... 10
2-2. Dimensionality Reduction ........................................................................................ 11
3-3. Dimensionality Reduction for Complexity Problem ................................................ 27
3-4. Kernel Sparse Representation Classifier ................................................................... 30
2-3. Classification............................................................................................................. 15
III. METHODOLOGY ............................................................................................................. 19
3-1. System Overview ...................................................................................................... 20
3-2. Feature Extraction ..................................................................................................... 22
3-5. Confidence Point ....................................................................................................... 32
3-6. Evaluation ................................................................................................................. 33
IV. EXPERIMENTAL RESULT ............................................................................................. 35
4-1. Letter Recognition .................................................................................................... 35
4-1-1. Dataset Configuration .................................................................................... 35
4-1-2. Evaluation in low resolution images .............................................................. 36
4-1-3. Evaluation with different kernel in kernel sparse representation classifier ... 37
4-1-4. Experimental result in speaker dependent, speaker independent, and speaker semi-dependent experiment ........................................................................... 39
4-2. Lip-Based Password Verification ............................................................................. 41
4-2-1. Database and Configuration ........................................................................... 41
4-2-2. Feature Compare ............................................................................................ 42
4-2-3. Dimensionality reduction using two dimension semi non-negative matrix factorization (2D Semi-NMF) ....................................................................... 43
4-2-4. Classification using Forward-Backward Kernel Sparse Representation ....... 46
4-2-5. Combinations of features, dimensionality reduction, and classifier evaluation ...................................................................................................... 47
V. CONCLUSION .................................................................................................................... 50
BIBLIOGRAPHY .................................................................................................................... 52

參考文獻

[1] S. Lee, K. Song, and J. Choi, “Access to an automated security system using gesture-based passwords,” Proc. 2012 15th Int. Conf. Network-Based Inf. Syst. NBIS 2012, pp. 760–765, 2012.
[2] M. A. B. Sarijari, R. a. Rashid, M. R. A. Rahim, and N. H. Mahalin, “Wireless home security and automation system utilizing ZigBee based multi-hop communication,” Proc. IEEE 2008 6th Natl. Conf. Telecommun. Technol. IEEE 2008 2nd Malaysia Conf. Photonics, NCTT-MCP 2008, no. August, pp. 242–245, 2008.
[3] Z. Zhou, G. Zhao, X. Hong, and M. Pietikäinen, “A review of recent advances in visual speech decoding,” Image Vis. Comput., vol. 32, no. 9, pp. 590–605, Sep. 2014.
[4] G. Potamianos, C. Neti, A. W. Senior, and S. Member, “Recent advances in the automatic recognition of audiovisual speech,” Proc. IEEE, vol. 91, no. 9, 2003.
[5] I. Matthews, G. Potamianos, C. Neti, and I. Matthews, “Audio-visual automatic speech recognition : an overview,” Issues inVisual Audio-v. Speech Process., 2004.
[6] G. Zhao, M. Barnard, and M. Pietikäinen, “Lipreading with local spatiotemporal descriptors,” IEEE Trans. Multimed., vol. 11, no. 7, pp. 1254–1265, 2009.
[7] E. Gomez, C. M. Travieso, J. C. Briceno, and M. a. Ferrer, “Biometric identification system by lip shape,” Proceedings. 36th Annu. 2002 Int. Carnahan Conf. Secur. Technol., pp. 39–42, 2002.
[8] Y. Lan, R. Harvey, B. Theobald, E. Ong, and R. Bowden, “Comparing visual features for lipreading,” in Auditory-Visual Speech Processing (AVSP), 2009.
[9] D. Bordencea, H. Valean, S. Folea, and A. Dobircau, “Agent based system for home automation, monitoring and security,” 2011 34th Int. Conf. Telecommun. Signal Process. TSP 2011 - Proc., no. 90371862, pp. 165–169, 2011.
[10] W. L. Ng, C. K. Ng, N. K. Noordin, and B. Mohd. Ali, “Gesture based automating household appliances,” Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 6762 LNCS, no. PART 2, pp. 285–293, 2011.
[11] A. K. Gnanasekar, P. Jayavelu, and V. Nagarajan, “Speech recognition based wireless automation of home loads with fault identification for physically challenged,” 2012 Int. Conf. Commun. Signal Process. ICCSP-2012, pp. 128–132, 2012.
[12] S. Cox, R. Harvey, Y. Lan, J. Newman, and B. Theobald, “The challenge of multispeaker lip-reading,” Int. Conf. Audit. Vis. Speech Process., 2008.
[13] G. Z. G. Zhao and M. Pietikainen, “Local binary pattern descriptors for dynamic texture recognition,” 18th Int. Conf. Pattern Recognit., vol. 2, pp. 18–21, 2006.
[14] P. a. Crook, V. Kellokumpu, G. Zhao, and M. Pietikainen, “Human activity recognition using a dynamic texture based method,” Procedings Br. Mach. Vis. Conf. 2008, pp. 88.1–88.10, 2008.
[15] I. T. Jolliffe, Principal Component Analysis, 2nd ed. New York: Springer-Verlag, 2002.
[16] H. Yu and J. Yang, “A direct LDA algorithm for high-dimensional data with application to face recognition,” Pattern Recognit., vol. 34, no. February, pp. 2067–2070, 2001.
[17] Z.-Q. Zhao, H. Glotin, Z. Xie, J. Gao, and X. Wu, “Cooperative sparse representation in two opposite directions for semi-supervised image annotation.,” IEEE Trans. Image Process., vol. 21, no. 9, pp. 4218–31, Sep. 2012.
[18] P. Comon, “Independent component analysis, a new concept?,” Signal Processing, vol. 36, no. 3, pp. 287–314, 1994.
[19] S. Tsuge, M. Shishibori, S. Kuroiwa, and K. Kita, “Dimensionality reduction using non-negative matrix factorization for information retrieval,” 2001 IEEE Int. Conf. Syst. Man Cybern. e-Systems e-Man Cybern. Cybersp., vol. 2, pp. 960–965, 2001.
[20] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma, “Robust face recognition via sparse representation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 2, pp. 210–227, 2009.
[21] L. Zhang, W.-D. Zhou, P.-C. Chang, J. Liu, Z. Yan, T. Wang, and F.-Z. Li, “Kernel sparse representation-based classifier,” IEEE Trans. Signal Process., vol. 60, no. 4, pp. 1684–1695, Apr. 2012.
[22] Y. Li and A. Ngom, “Sparse representation for the classification of high-dimensional biological data,” BMC Syst. Biol., vol. 07, pp. 306–311, 2013.
[23] M. Chora, “Lips recognition for biometrics,” ICB, pp. 1260–1269, 2009.
[24] H. A. Mahmoud, F. Bin Muhaya, and A. Hafez, “Lip reading based surveillance system,” 2010 5th Int. Conf. Futur. Inf. Technol. Futur. 2010 - Proc., 2010.
[25] S. Sengupta, A. Bhattacharya, P. Desai, and A. Gupta, “Automated lip reading technique for password authentication,” Int. J. Appl. Inf. Syst., vol. 4, no. 3, pp. 18–24, 2012.
[26] P. Singh, V. Laxmi, M. S. Gaur, and Acm, “Lip peripheral motion for visual surveillance,” Proc. Fifth Int. Conf. Secur. Inf. Networks, pp. 173–177, 2012.
[27] I. Matthews, T. F. Cootes, J. a. Bangham, S. Cox, and R. Harvey, “Extraction of visual features for lipreading,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 2, pp. 198–213, 2002.
[28] X. Liu and Y. M. Cheung, “Learning multi-boosted HMMs for lip-password based speaker verification,” IEEE Trans. Inf. Forensics Secur., vol. 9, no. 2, pp. 233–246, 2014.
[29] S. W. F. S. W. Foo, Y. L. Y. Lian, and L. D. L. Dong, “Recognition of visual speech elements using adaptively boosted hidden markov models,” IEEE Trans. Circuits Syst. Video Technol., vol. 14, no. 5, pp. 693–705, 2004.
[30] T. F. Cootes, C. J. Taylor, D. H. Cooper, and J. Graham, “Active shape models-their training and application,” Computer Vision and Image Understanding, vol. 61, no. 1. pp. 38–59, 1995.
[31] A. Lanitis, C. J. Taylor, and T. F. Cootes, “Automatic interpretation and coding of face images using flexible models,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 19, no. 7, pp. 743–756, 1997.
[32] T. Cootes and C. Taylor, “Combining point distribution models with shape models based on finite element analysis,” Image Vis. Comput., vol. 13, no. 5, pp. 403–409, 1995.
[33] S. Gao, I. W.-H. Tsang, and L.-T. Chia, “Sparse representation with kernels.,” IEEE Trans. Image Process., vol. 22, no. 2, pp. 423–34, Feb. 2013.
[34] T. F. Cootes, G. J. Edwards, and C. J. Taylor, “Active appearance models,” Proc. 5th Eur. Conf. Comput. Vis. (Computer Vis. - ECCV’98), vol. 23, no. 6, pp. 484–498, 1998.
[35] T. F. Cootes and C. J. Taylor, “A mixture model for representing shape variation,” Image Vis. Comput., vol. 17, no. 8, pp. 567–573, 1999.
[36] I. Matthews, T. Cootes, S. Cox, R. Harvey, and J. A. Bangham, “Lipreading using shape, shading and scale,” Audit. Speech Process., no. 1, 1998.
[37] J. F. Guitarte Pérez, A. F. Frangi, E. L. Solano, and K. Lukas, “Lip reading for robust speech recognition on embedded devices,” ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process. - Proc., vol. I, pp. 473–476, 2005.
[38] I. Shdaifat, R. Grigat, and D. Langmann, “A system for automatic lip reading,” in AVSP 2003, International Conference on Audio-Visual Speech Processing, 2003.
[39] T. F. Cootes, G. Edwards, and C. J. Taylor, “Comparing active shape models with active appearance models,” Procedings Br. Mach. Vis. Conf. 1999, pp. 18.1–18.10, 1999.
[40] I. Matthews, T. F. Cootes, J. A. Bangham, S. Cox, and R. Harvey, “Extraction of visual features for lipreading,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 2, pp. 198–213, 2002.
[41] T. Ojala, M. Pietikäinen, and T. Mäenpää, “Multiresolution gray-scale and rotation invariant texture classification with local binary patterns,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 7, pp. 971–987, 2002.
[42] T. Ojala, M. Pietikäinen, and T. Mäenpää, “A generalized local binary pattern operator for multiresolution gray scale and rotation invariant texture classification,” Adv. Pattern Recognit., vol. 2013, pp. 399–408, 2001.
[43] T. Kobayashi and J. Ye, “Acoustic feature extraction by statistics based local binary pattern for environmental sound classification,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3052–3056, 2014.
[44] F. Perronnin, J. Sánchez, and T. Mensink, “Improving the fisher kernel for large-scale image classification,” Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 6314 LNCS, pp. 143–156, 2010.
[45] V. Kellokumpu, G. Zhao, and M. Pietikäinen, “Human activity recognition using a dynamic texture based method,” Br. Mach. Vis. Conf., pp. 1–10, 2008.
[46] C. H. Chan, B. Goswami, J. Kittler, and W. Christmas, “Local ordinal contrast pattern histograms for spatiotemporal, lip-based speaker authentication,” in IEEE Transactions on Information Forensics and Security, vol. 7, no. 2, pp. 602–612, 2012.
[47] K. Messer, J.Matas, J.Kittler, J.Luettin, and G.Maitre, “XM2VTSDB: The extended M2VTS database,” in Second International Conference on Audio and Video-based Biometric Person Aunthentication (AVBPA’99), pp. 72–77, 1999.
[48] M. S. Bartlett, J. R. Movellan, and T. J. Sejnowski, “Face recognition by independent component analysis,” IEEE Trans. Neural Networks, vol. 13, no. 6, pp. 1450–1464, 2002.
[49] A. J. Bell and T. J. Sejnowski, “An information-maximization approach to blind separation and blind deconvolution.,” Neural Comput., vol. 7, no. 6, pp. 1129–1159, 1995.
[50] P. Paatero, “Least squares formulation of robust non-negative factor analysis,” Chemom. Intell. Lab. Syst., vol. 37, no. 1, pp. 23–35, 1997.
[51] D. D. Lee and H. S. Seung, “Learning the parts of objects by non-negative matrix factorization.,” Nature, vol. 401, no. 6755, pp. 788–791, 1999.
[52] C. Ding, T. Li, and M. I. Jordan, “Convex and semi-nonnegative matrix factorizations,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 1, pp. 45–55, 2010.
[53] W.-C. Hsieh, C.-W. Ho, V.-H. Duong, Y.-S. Lee, and J.-C. Wang, “2D semi-NMF of scale-frequency map for environmental sound classification,” Signal Inf. Process. Assoc. Annu. Summit Conf. (APSIPA), 2014 Asia-Pacific, pp. 1–4, Dec. 2014.
[54] V. N. Vapnik, The natural of statistical learning theory, 2nd ed. New York: Springer, 1995.
[55] B. E. Boser, I. M. Guyon, and V. N. Vapnik, “A training algorithm for optimal margin classifiers,” Proc. 5th Annu. ACM Work. Comput. Learn. Theory, pp. 144–152, 1992.
[56] M. Gurban and J. P. Thiran, “Audio-visual speech recognition with a hybrid SVM-HMM system,” in 13th European Signal Processing Conference (EUSIPCO), pp. 728–731, 2005.
[57] J. He and Z. Hua, “Lipreading recognition based on SVM and DTAK,” 2010 4th Int. Conf. Bioinforma. Biomed. Eng. iCBBE 2010, no. 2, pp. 1–3, 2010.
[58] A. a. Shaikh, D. K. Kumar, W. C. Yau, and J. Gubbi, “Lip reading using optical flow and support vector machines,” 3rd Int. Congr. Image Signal Process., pp. 327–330, 2010.
[59] M. Gordan, C. Kotropoulos, and I. Pitas, “Visual speech recognition using support vector machines,” 2002 14th Int. Conf. Digit. Signal Process. Proceedings. DSP 2002 (Cat. No.02TH8628), vol. 2, pp. 1093–1096, 2002.
[60] L. R. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition,” Proceedings of the IEEE, vol. 77. pp. 257–286, 1989.
[61] N. Morgan and H. Bourlad, “An introduction to hybrid HMM/connectionist continuous speech recognition,” IEEE Signal Process. Mag., vol. 12, no. 3, pp. 25–42, 1995.
[62] S. Gao, I. W. Tsang, and L. Chia, “Kernel sparse representation for image classification and face recognition,” ECCV, no. i, pp. 1–14, 2010.
[63] S. Siatras, N. Nikolaidis, M. Krinidis, and I. Pitas, “Visual Lip Activity Detection and Speaker Detection Using Mouth Region Intensities,” IEEE Trans. Circuits Syst. Video Technol., vol. 19, no. 1, pp. 133–137, Jan. 2009.
[64] B. Rivet, L. Girin, and C. Jutten, “Visual voice activity detection as a help for speech source separation from convolutive mixtures,” Speech Commun., vol. 49, no. 7–8, pp. 667–677, 2007.
[65] Q. Liu, W. Wang, and P. Jackson, “A visual voice activity detection method with adaboosting,” in Sensor Signal Processing for Defence, pp. 1–5, 2011.
[66] V. Libal, J. Connell, G. Potamianos, and E. Marcheret, “An embedded system for in-vehicle visual speech activity detection,” 2007 IEEE 9Th Int. Work. Multimed. Signal Process. MMSP 2007 - Proc., pp. 255–258, 2007.
[67] T. Huang, G. Yang, and G. Tang, “A fast two-dimensional median filtering algorithm,” IEEE Trans. Acoust., vol. 27, no. 1, 1979.
[68] M. Pietikäinen, A. Hadid, G. Zhao, and T. Ahonen, Computer Vision Using Local Binary Patterns, vol. 40. London: Springer London, 2011.
[69] D. Zhang, S. Chen, and Z. Zhou, “Two-dimensional non-negative matrix factorization for face representation and recognition,” ICCV 2005 Work. Anal. Model. Faces Gestures, pp. 350–363, 2005.
[70] C. J. C. Burges, “A tutorial on support vector machines for pattern recognition,” Data Min. Knowl. Discov., vol. 2, no. 2, pp. 121–167, 1998.
[71] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng, “Multimodal deep learning,” Proc. 28th Int. Conf. Mach. Learn., no. Washington, pp. 689–696, 2011.

指導教授

王家慶(Jia-Ching Wang)

審核日期

2015-8-5

推文