基於長短期記憶深層學習方法之動作辨識

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：36

、訪客IP：13.58.72.85

姓名

江金晉(Chin-Chin Chiang) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

基於長短期記憶深層學習方法之動作辨識

相關論文

★ Single and Multi-Label Environmental Sound Recognition with Gaussian Process	★ 波束形成與音訊前處理之嵌入式系統實現
★ 語音合成及語者轉換之應用與設計	★ 基於語意之輿情分析系統
★ 高品質口述系統之設計與應用	★ 深度學習及加速強健特徵之CT影像跟骨骨折辨識及偵測
★ 基於風格向量空間之個性化協同過濾服裝推薦系統	★ RetinaNet應用於人臉偵測
★ 金融商品走勢預測	★ 整合深度學習方法預測年齡以及衰老基因之研究
★ 漢語之端到端語音合成研究	★ 基於 ARM 架構上的 ORB-SLAM2 的應用與改進
★ 基於深度學習之指數股票型基金趨勢預測	★ 探討財經新聞與金融趨勢的相關性
★ 基於卷積神經網路的情緒語音分析	★ 運用深度學習方法預測阿茲海默症惡化與腦中風手術存活

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 ( 永不開放)

摘要(中)

生活品質不斷提升、便捷性不斷增加的同時，多少功能與應用仰賴於背後的技術支援與開發。從影像到影片、從姿勢到動作，隨著技術與硬體的不斷進步，我們所需要、所面對的，是更上層樓的功能與效果。
基於長短期記憶的深層學習架構，我們提出了光流注意力模型。該模型透過光流圖的使用進行影片中的動作辨識。在提出的架構中，各影片皆切為多個幀影像，每幀影像都透過CNN進行特徵擷取，並依時序將特徵輸入至光流注意力模型中。注意力模型主要由LSTM組成，其特色在於輸入資料會先透過處理過的光流注意力權重圖做為特徵的權重值以提高特徵中的重要部分。而調整後的特徵會再繼續輸入至LSTM，並產生該時序的辨識結果。
本論文藉由光流圖作為權重值對影像的重要區域進行動態追蹤，以提高重要特徵所具有的權重。在動作辨識的實驗中，我們提出的光流注意力模型高於僅使用LSTM約3.6%，高於參考的視覺注意力模型約2.4%。而若與視覺注意力結合，則整體架構能高於僅使用LSTM約4.5%，高於只使用視覺注意力模型約3.3%。實驗結果顯示出以光流圖作為權重值能有效地捕捉影片動作中的具鑒別性區域，並能與視覺注意力作互補產生更好的辨識效果。

摘要(英)

In the meantime while the quality of life promotes continuously and the convenience increase constantly, so many uses and applications rely on the support of technology and exploitation behind. From image to video, and from gesture to action, what we need to face with the succeeding improvement of technology and hardware, is the much better function and effect.
Based on the architecture of deep learning of long short-term memory, we proposed the optical flow attention model. This model do action recognition for videos through the use of optical flow images. In the proposed architecture, each video is separated to frame images, and feed into CNN for feature extraction. Each feature input into the optical flow model followed by the time sequence. The attention model is mainly composed by LSTM, and the characteristic of optical flow attention is that the input feature weighted by the optical flow weight image firstly to highlight the important part of current feature. And the adjusted feature input into LSTM after weighted and produce the recognition result at that time step.
The thesis does dynamical tracing on the important area of image using optical flow image as weights to promote the weights at the important part of feature. In the experiment of action recognition, the optical flow image we proposed grows about 3.6% accuracy compared with the model only use LSTM, and get 2.4% higher compared with the visual attention model we referenced. And we combine the visual attention model with our optical flow attention model, getting 4.5% higher than LSTM and 3.6% higher than the visual attention model. The experiment result shows that using optical flow image as weights brings the effect to capture the discriminate area of action in video, and can complement with visual attention to reach better recognition effect.

關鍵字(中)

★ 動作辨識
★ 長短期記憶
★ 深層學習
★ 注意力模型
★ 卷積神經網路
★ 類神經網路

關鍵字(英)

★ Action recognition
★ Long short-term memory
★ Deep learning
★ Attention model
★ Convolutional neural network
★ Neural network

論文目次

摘要 i
Abstract ii
章節目次 iv
圖目錄 v
表目錄 vii
第一章緒論 1
1.1 前言 1
1.2 研究動機與目的 1
1.3 論文架構與章節概要 3
第二章神經網路相關文獻探討 5
2.1 類神經網路 5
2.1.1 類神經網路的發展 5
2.1.2 類神經網路的原理 6
2.1.3 類神經網路的倒傳遞 9
2.2 深層神經網路 14
2.2.1 卷積神經網路 16
2.2.2 遞迴神經網路 20
2.3 動作辨識 24
第三章長短期記憶單元 25
第四章視覺注意力模型 30
第五章光流注意力模型 35
第六章實驗結果與分析討論 41
第七章結論與未來研究方向 49

參考文獻

[1] L. Lin, K. Wang, W. Zuo, M. Wang, J. Luo, and L. Zhang, “A deep structured model with radius–margin bound for 3D human activity recognition,” International Journal of Computer Vision, 1-18, 2015.
[2] S. Ji, W. Xu, M. Yang and K. Yu, “3D Convolutional Neural Networks for Human Action Recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 1, pp. 221-231, Jan, 2013.
[3] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar and L. Fei-Fei, “Large-Scale Video Classification with Convolutional Neural Networks,” IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, pp. 1725-1732, 2014.
[4] L. Pigou, A. Oord, S. Dieleman, M. Herreweghe, and J. Dambre, “Beyond Temporal Pooling: Recurrence and Temporal Convolutions for Gesture Recognition in Video,” arXiv preprint arXiv:1506.01911, 2015.
[5] W. McCulloch and W. Pitts. “A logical calculus of the ideas immanent in nervous activity,” The bulletin of mathematical biophysics, vol. 5, no. 4, pp. 115-133, 1943.
[6] D. Hebb, “The Organization of Behavior: A Neuropsychological Theory,” New York: Wiley, 1949.
[7] F. Rosenblatt, “The perceptron: A probabilistic model for information storage and organization in the brain,” Psychological Review, Vol 65(6), Nov 1958, 386-408.
[8] M. Minsky, S. Papert, “Perceptrons,” M.I.T. Press Perceptrons, 1969
[9] D. Rumelhart, G. Hinton, and R. Williams, “Learning representations by back-propagating errors,” Neurocomputing: foundations of research, James A. Anderson and Edward Rosenfeld (Eds.). MIT Press, Cambridge, MA, USA 696-699, 1988.
[10] M. Minsky and S. Papert, “Perceptrons: Expanded Edition,” MIT Press, Cambridge, MA, USA, 1988.
[11] D. Rumelhart, G. Hinton, R. Williams, “Learning Internal Representations by Error Propagation” Technical rept., Mar-Sep, 1985.
[12] G. Hinton, S. Osindero, Y. Teh, “A Fast Learning Algorithm for Deep Belief Nets” Neural computation, Vol. 18, No. 7, Pages 1527-1554, 2006.
[13] G. Hinton, R. Salakhutdinov, “Reducing the Dimensionality of Data with Neural Networks.” Science, Vol. 313, Issue 5786, pp. 504-507, 2006.
[14] Y. Lecun, L. Bottou, Y. Bengio and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324, Nov 1998.
[15] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, pp. 1-9, 2015.
[16] P. Werbos, “Backpropagation through time: what it does and how to do it,” Proceedings of the IEEE, vol. 78, no. 10, pp. 1550-1560, Oct 1990.
[17] I. Sutskever, O. Vinyals, and Q. Le. “Sequence to sequence learning with neural networks,” Advances in neural information processing systems, 2014.
[18] K. Cho, B. Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk , Y. Bengio, “Learning phrase representations using RNN encoder-decoder for statistical machine translation,” arXiv preprint arXiv:1406.1078, 2014.
[19] R. O’Reilly, “Biologically Plausible Error-driven Learning using Local Activation Differences:The Generalized Recirculation Algorithm,” Neural Computation, 8:5, 895-938, 1996.
[20] D. Ciresan, A. Giusti, L. Gambardella, and J. Schmidhuber, “Deep neural networks segment neuronal membranes in electron microscopy images,” Advances in neural information processing systems, 2012.
[21] A. Karpathy and L. Fei-Fei. “Deep visual-semantic alignments for generating image descriptions,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015.
[22] W. Byeon, T. Breuel, F. Raue, and M. Liwicki, “Scene labeling with lstm recurrent neural networks,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015.
[23] R. Girshick, J. Donahue, T. Darrell, and J. Malik, "Rich feature hierarchies for accurate object detection and semantic segmentation." Proceedings of the IEEE conference on computer vision and pattern recognition, 2014.
[24] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural image caption generator,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015.
[25] J. Donahue, L. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell, “Long-term recurrent convolutional networks for visual recognition and description,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015.
[26] D. Bahdanau, K. Cho, and Y. Bengio. “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014.
[27] R. Girshick, “Fast r-cnn.” Proceedings of the IEEE International Conference on Computer Vision, 2015.
[28] S. Ren, K. He, R. Girshick, and J.Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” Advances in neural information processing systems, 2015.
[29] S. Sharma, R. Kiros, and R. Salakhutdinov, “Action recognition using visual attention,” arXiv preprint arXiv:1511.04119, 2015.
[30] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. Zemel, Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention.” arXiv preprint arXiv:1502.03044, 2015.
[31] T. Brox, A. Bruhn, N. Papenberg, J. Weickert, “High accuracy optical flow estimation based on a theory for warping,” Computer Vision-ECCV 2004, Springer Berlin Heidelberg, pp. 25-36, 2004.
[32] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no.8, pp. 1735-1780, 1997.
[33] M. Schuster and K. Paliwal, “Bidirectional recurrent neural networks,” IEEE Transactions on Signal Processing, vol.45, no.11, pp. 2673-2681 , 1997.
[34] J. Chung, C. Gulcehre, K. Cho, amd Y. Bengio “Empirical evaluation of gated recurrent neural networks on sequence modeling,” arXiv preprint arXiv:1412.3555, 2014.
[35] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on IEEE, 2009.
[36] C. Ding and D. Tao, “Robust face recognition via multimodal deep face representation,” IEEE Transactions on Multimedia, vol. 17, no. 11, pp. 2049-2058, 2015.
[37] L. Pigou, S. Dieleman, P. Kindermans, and B. Schrauwen, “Sign language recognition using convolutional neural networks,” Workshop at the European Conference on Computer Vision, Springer International Publishing, 2014.
[38] S. Sukittanon, A. Surendran, J. Platt, and C. Burges, “Convolutional networks for speech detection,” Interspeech, 2004.
[39] O. Abdel-Hamid, A. Mohamed, H. Jiang, Li Deng, G. Penn, and D. Yu “Convolutional neural networks for speech recognition,” IEEE/ACM Transactions on audio, speech, and language processing, vol. 22, no. 10, pp. 1533-1545, 2014.
[40] Y. Wang and D. Wang “Cocktail party processing via structured prediction,” Advances in Neural Information Processing System, 2012.
[41] Y. Wang and D. Wang, “Towards scaling up classification-based speech separation,” IEEE Transactions on Audio, Speech, and Language Processing, vol.21, no.7, pp. 1381-1390, 2013.
[42] D. Ciresan, U. Meier, and J. Schmidhuber, “Multi-column deep neural networks for image classification,” Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012.
[43] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “Deepface: Closing the gap to human-level performance in face verification,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014.
[44] D. Ciresan, A. Giusti, L. Gambardella, J. Schmidhuber, “Mitosis detection in breast cancer histology images with deep neural networks,” International Conference on Medical Image Computing and Computer-assisted Intervention, Springer Berlin Heidelberg, 2013.
[45] G. Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, and B. Kingsbury, “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal Processing Magazine, vol.29, no.6, pp. 82-97, 2012.
[46] J. Li, Y. Wei, X. Liang, J. Dong, T. Xu, J. Feng, and S. Yan, “Attentive Contexts for Object Detection,” arXiv preprint arXiv:1603.07415, 2016.
[47] J. Johnson, A. Karpathy, L. Fei-Fei, “Densecap: Fully convolutional localization networks for dense captioning,” arXiv preprint arXiv:1511.07571, 2015.
[48] K. He, X. Zhang, S. Ren, J. Sun, “Spatial pyramid pooling in deep convolutional networks for visual recognition," European Conference on Computer Vision. Springer International Publishing, 2014.
[49] P. Wang, Y. Cao, C. Shen, L. Liu, H. Shen, “Temporal pyramid pooling based convolutional neural networks for action recognition,” arXiv preprint arXiv:1503.01224, 2015.
[50] J. Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, G. Toderici, “Beyond short snippets: Deep networks for video classification,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015.
[51] H. Lee and H. Kwon, “Contextual Deep CNN Based Hyperspectral Classification,” arXiv preprint arXiv:1604.03519, 2016.
[52] P. Scovanner, S. Ali, and M. Shah, “A 3-dimensional sift descriptor and its application to action recognition,” Proceedings of the 15th ACM international conference on Multimedia, ACM, 2007.
[53] A. Klaser, M. Marcin, and S. Cordelia, “A spatio-temporal descriptor based on 3d-gradients,” BMVC 2008-19th British Machine Vision Conference, British Machine Vision Association, 2008.
[54]B. Nair and V. Asari, “Regression Based Learning of Human Actions from Video Using HOF-LBP Flow Patterns,” IEEE International Conference on Systems, Man, and Cybernetics, Manchester, pp. 4342-4347, 2013.
[55]C. Chen, R. Jafari, and N. Kehtarnavaz, “Action Recognition from Depth Sequences Using Depth Motion Maps-Based Local Binary Patterns,” IEEE Winter Conference on Applications of Computer Vision, Waikoloa, HI, pp. 1092-1099, 2015.
[56]N. Ikizler-Cinbis and S. Sclaroff, “Object, scene and actions: Combining multiple features for human action recognition,” European conference on computer vision, Springer Berlin Heidelberg, 2010.
[57] J. Cho, M. Lee, and S.Oh, “Robust action recognition using local motion and group sparsity,” Pattern Recognition, vol. 47, no. 5, 1813-1825, 2014.

指導教授

王家慶(Jia-Ching Wang)

審核日期

2016-8-29

推文