以多模態時空域建模的深度學習方法分類影像中的動態模式

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：9

、訪客IP：3.129.209.81

姓名

珊芝莉(S P Kasthuri Arachchi) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

以多模態時空域建模的深度學習方法分類影像中的動態模式
(Modelling Spatial-Motion Multimodal Deep Learning Approaches to Classify Dynamic Patterns of Videos)

相關論文

★ 基於注意力之用於物件定位的語義分割方法	★ 基於圖卷積網路的自動門檢測
★ 基於職業技能和教育視訊之學習內容生成與總結方法

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

對於電腦視覺，影片分類是一重要的過程，可用來分析影片內容之語意訊息。本篇論文改良常見之深度學習分類模型，提出適用於動態影片分類之多模式深度學習方法。當影片處於不同照明等嚴苛環境下，傳統方法提供之手動功能是不足且沒有效率的，尤其是針對複雜內容物的影片。先前的影片分類研究中，主要專注於影片各影片流之關聯性，本篇論文則使用深度學習做為策略，成功地提升影片分類之準確率。大多深度學習模型使用卷積神經網路與長短期記憶網路為基底模型，可用來做物件與行為之分類，並且能夠在動態時間之影片分類任務中能有很好的表現。
首先，本篇論文單流的網路及底層之實驗網路包含卷積神經網路 (CNN) 長短期循環網路(LSTM)及環循神經單元(GRU)。在LSTM與GRU模型中，各層的參數與Dropout值都是經由最佳化調整而產生的。本研究三個模型中將被比較：(1) LRCN：將捲積層與遠程時間遞歸相結合、(2) seqLSTMs：對順序數據進行建模的最有效之模型、(3) seqGRUs：在運算量的表現上比LSTM還要好。
其次，為了考量空間中動量之關係，本論文提出以影像及光流影像之雙流輸入為主之新穎模型，稱之為狀態交換長短期記憶（SE-LSTM）亦為本篇論文的貢獻。藉由SE-LSTM，將能夠完成動態影片在於短期運動、空間和長期時間信息上分類之任務，並能透過外觀流和運動流先前單元之狀態交換信息來擴展LSTM。此外，本篇論文提出一將SE-LSTM與CNN相結合的雙流模型Dual-CNNSELSTM。為了驗證SE-LSTM模型架構之表現，本篇論文針對各樣視頻如煙花、手勢和人的行為做驗證。實驗結果證明，本論文提出的雙流Dual-CNNSELSTM模型結構其性能明顯優於其他單流和雙流為主之模型，手勢、煙花和人為動作數據集HMDB51的準確度分別達到81.62％，79.87％和69.86％。因此，總體結果證明，所提出的模型適合靜態背景動態模式分類，其表現超越Dual-3DCNNLSTM模型及其他模型。

摘要(英)

Video classification is an essential process for analyzing the pervasive semantic information of video content in computer vision. This thesis presents multimodal deep learning approaches to classify the dynamic patterns of videos, beyond common types of pattern classifications. Traditional handcrafted features are insufficient when classifying complex video information due to the similarity of visual contents with different illumination conditions. Prior studies of video classifications focused on the relationship between the standalone streams themselves. In contrary, this study leverages the effects of deep learning methodologies to improve video analysis performance significantly. Convolution Neural Network (CNN) and Long Short-term Memory (LSTM) are widely used to build complex models and have shown great competency in modeling temporal dynamics in video-based pattern classification.
First, the single-stream networks and the underlying experimental models consist of CNN, LSTM and Gated Recurrent Unit (GRU) are considered. Their layer parameters are fine-tuned and different dropout values are used with sequence LSTM and GRU models. During this study, the accuracy of three basic models: (1) a Long-term Recurrent Convolutional Network (LRCN), which combine convolutional layers with long-range temporal recursion, (2) seqLSTMs model, one of the most effective structures to model sequential data and (3) seqGRUs model, which has less computational steps than LSTM, are compared.
Secondly, an approach with two-stream network architectures taking both RGB and optical flow data as input is used considering spatial motion relationships. As the main contributions of this work, a novel two-stream neural network concept, named state-exchanging long short-term memory (SE-LSTM) is introduced. With the model of spatial motion state-exchanging, the SE-LSTM can classify dynamic patterns of videos integrating short-term motion, spatial, and long-term temporal information. The SE-LSTM extends the general purpose of LSTM by exchanging the information with previous cell states of both appearance and motion streams. Further, a novel two-stream model Dual-CNNSELSTM utilizing the SE-LSTM concept combined with a CNN is proposed. Various video datasets: firework displays, hand gestures and human actions are used to validate the proposed SE-LSTM architecture. Experimental results demonstrate that the performance of the proposed two-stream Dual-CNNSELSTM architecture significantly outperforms other single and two-stream baseline models achieving accuracies of 81.62%, 79.87%, and 69.86% with hand gestures, fireworks displays, and HMDB51 human actions datasets, respectively. Therefore, the overall results signify that the proposed model is most suited to static background dynamic pattern classifications over baseline and Dual-3DCNNLSTM models.

關鍵字(中)

★ 動態圖形分類
★ 深度學習
★ 時空數據
★ 卷積神經網路
★ 循環神經網路

關鍵字(英)

★ Dynamic Pattern Classification
★ Deep Learning
★ Spatiotemporal Data
★ Convolution Neural Network
★ Recurrent Neural Network

論文目次

Abstract i
摘要 iii
Acknowledgement iv
Table of Contents vii
List of Figures xi
List of Tables xv
Abbreviations xvi
Explanation of Symbols xvii
Chapter 1 1
Introduction 1
1.1 Background 1
1.2 Dissertation Organization 3
Chapter 2 4
Related Work 4
2.1. Artificial Neural Networks 4
2.2. Deep Neural Network 7
2.3. Convolution Neural Network 8
2.3.1. Basics of Convolution Neural Network 8
2.3.2. Mathematical Background of Convolutional Neural Network 12
2.3.3. Successful Convolutional Neural Network Architectures 14
2.4. Recurrent Neural Network 16
2.4.1. Overview of Recurrent Neural Network 16
2.4.2. Gradients Vanishing Problem 18
2.4.3. Long short-term memory networks (LSTM) 19
2.4.4. The compact forms of the equations of an LSTM unit 21
2.4.5. Variants of Long short-term memory 22
2.4.6. Successful Long Short-term Memory Architectures 25
2.5. Training Neural Network 29
2.5.1. Supervised learning and unsupervised learning 29
2.5.2. Backpropagation 30
2.5.3. Learning Rate Configuration 36
2.5.4. Optimization Methods 37
2.6. Overfitting and Underfitting 47
2.6.1. Regularization Techniques and Constrain 48
2.6.2. Dropout 49
2.7. Batch Normalization 54
2.8. Deep Neural Networks for Video Classification 55
Chapter 3 58
Single Stream 58
3.1 Firework Dataset 59
3.2 Experimental Architectures 61
3.2.1. LRCN - with single Dropout layer (Lrcn1Drop) and double Dropout layers (Lrcn2Drop) 62
3.2.2. RNN- SeqLSTM with single Dropout layer (Lstm1Drop) and with double Dropout layers (Lstm2Drop) 64
3.2.3. RNN - SeqGRU with single Dropout layer (Gru1Drop) and with double Dropout layers (Gru2Drop) 66
3.3 Experimental Evaluation 67
3.3.1. Experimental Setup 67
3.3.2. Evaluate Model Skill with Dataset Size 68
3.3.3. Model Performance with Dropout Layers 71
3.3.4. Fine-tuning Parameters 73
3.3.5. LSTM over GRU Performance 74
3.3.6. Model Complexity 75
3.3.7. Classification Mismatches 76
Chapter 4 79
Two Stream 79
4.1 Introduction 79
4.1.1. Modeling Long-Term Temporal Dynamics 79
4.1.2. End-to-End CNN Architecture 81
4.2 State-Exchanging Long Short-Term Memory (SE-LSTM) 82
4.2.1. State-Exchanging Process 84
4.2.2. State-Exchanging Package 86
4.3 Dual-CNNSELSTM Model 88
4.4 Dual-3DCNNLSTM Model 90
4.5 Two-stream Fusion 91
4.6 Datasets 93
4.6.1. Hand Gestures Dataset 93
4.6.2. HMDB51 Dataset 94
4.6.3. Dense Optical Flow 94
Chapter 5 97
Evaluation with Firework Dataset 97
5.1 Experimental Setup 97
5.2 Comparison of Control Experiments 98
5.3 Performance Evaluation of Single and Two Stream Networks 99
5.4 SE-LSTM Performance Over Two-stream Baseline Models 101
5.5 Model Skills with Multiple Fusion Benchmarks 103
5.6 Effect of learning Rate Schedules 104

Chapter 6 108
Evaluation with Other Datasets 108
6.1 Comparison with Firework dataset 108
6.2 Performance of Dual-CNNSELSTM with Dual-3DCNNLSTM Model 109
6.3 Evaluation of HMDB51 Dataset 113
Chapter 7 116
Discussion 116
Chapter 8 119
Conclusion and Future Works 119
Appendix A 121
References 127

參考文獻

[1] Z. Wu, T. Yao, Y. Fu, and Y.-G. Jiang, “Deep Learning for Video Classification and Captioning,” ArXiv160906782 Cs, pp. 3–29, Dec. 2017, doi: 10.1145/3122865.3122867.
[2] C. Farabet, C. Couprie, L. Najman, and Y. LeCun, “Learning Hierarchical Features for Scene Labeling,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 8, pp. 1915–1929, Aug. 2013, doi: 10.1109/TPAMI.2012.231.
[3] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” ArXiv14091556 Cs, Sep. 2014, Accessed: Jan. 21, 2020. [Online]. Available: http://arxiv.org/abs/1409.1556.
[4] S. Ji, W. Xu, M. Yang, and K. Yu, “3D Convolutional Neural Networks for Human Action Recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 1, pp. 221–231, Jan. 2013, doi: 10.1109/TPAMI.2012.59.
[5] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencies with gradient descent is difficult,” IEEE Trans. Neural Netw., vol. 5, no. 2, pp. 157–166, Mar. 1994, doi: 10.1109/72.279181.
[6] K. Simonyan and A. Zisserman, “Two-Stream Convolutional Networks for Action Recognition in Videos,” in Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2014, pp. 568–576.
[7] H. Wang and C. Schmid, “Action Recognition with Improved Trajectories,” in 2013 IEEE International Conference on Computer Vision, Dec. 2013, pp. 3551–3558, doi: 10.1109/ICCV.2013.441.
[8] J. Y.-H. Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici, “Beyond Short Snippets: Deep Networks for Video Classification,” ArXiv150308909 Cs, Mar. 2015, Accessed: Jan. 21, 2020. [Online]. Available: http://arxiv.org/abs/1503.08909.
[9] Z. Wu, X. Wang, Y.-G. Jiang, H. Ye, and X. Xue, “Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification,” ArXiv150401561 Cs, Apr. 2015, Accessed: Jan. 21, 2020. [Online]. Available: http://arxiv.org/abs/1504.01561.
[10] A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang, “NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis,” presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1010–1019, Accessed: Jun. 13, 2020. [Online]. Available: http://openaccess.thecvf.com/content_cvpr_2016/html/Shahroudy_NTU_RGBD_A_CVPR_2016_paper.html.
[11] B. Singh, T. K. Marks, M. Jones, O. Tuzel, and M. Shao, “A Multi-stream Bi-directional Recurrent Neural Network for Fine-Grained Action Detection,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2016, pp. 1961–1970, doi: 10.1109/CVPR.2016.216.
[12] L. Pigou, A. van den Oord, S. Dieleman, M. Van Herreweghe, and J. Dambre, “Beyond Temporal Pooling: Recurrence and Temporal Convolutions for Gesture Recognition in Video,” ArXiv150601911 Cs Stat, Feb. 2016, Accessed: Jan. 18, 2020. [Online]. Available: http://arxiv.org/abs/1506.01911.
[13] Yong Du, W. Wang, and L. Wang, “Hierarchical recurrent neural network for skeleton based action recognition,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2015, pp. 1110–1118, doi: 10.1109/CVPR.2015.7298714.
[14] V. Veeriah, N. Zhuang, and G.-J. Qi, “Differential Recurrent Neural Networks for Action Recognition,” ArXiv150406678 Cs, Apr. 2015, Accessed: Jan. 21, 2020. [Online]. Available: http://arxiv.org/abs/1504.06678.
[15] P. Molchanov, X. Yang, S. Gupta, K. Kim, S. Tyree, and J. Kautz, “Online Detection and Classification of Dynamic Hand Gestures with Recurrent 3D Convolutional Neural Networks,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2016, pp. 4207–4215, doi: 10.1109/CVPR.2016.456.
[16] N. L. Hakim, T. K. Shih, S. P. Kasthuri Arachchi, W. Aditya, Y.-C. Chen, and C.-Y. Lin, “Dynamic Hand Gesture Recognition Using 3DCNN and LSTM with FSM Context-Aware Model,” Sensors, vol. 19, no. 24, p. 5429, Jan. 2020, doi: 10.3390/s19245429.
[17] S. Abu-El-Haija et al., “YouTube-8M: A Large-Scale Video Classification Benchmark,” ArXiv160908675 Cs, Sep. 2016, Accessed: Apr. 28, 2020. [Online]. Available: http://arxiv.org/abs/1609.08675.
[18] Y. Jiang, Z. Wu, J. Wang, X. Xue, and S. Chang, “Exploiting Feature and Class Relationships in Video Categorization with Regularized Deep Neural Networks,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 2, pp. 352–364, Feb. 2018, doi: 10.1109/TPAMI.2017.2670560.
[19] F. Rosenblatt, “The perceptron: A probabilistic model for information storage and organization in the brain,” Psychol. Rev., vol. 65, no. 6, pp. 386–408, 1958, doi: 10.1037/h0042519.
[20] “Rectifier (neural networks),” Wikipedia. Dec. 04, 2018, Accessed: Jan. 21, 2020. [Online]. Available: https://en.wikipedia.org/w/index.php?title=Rectifier_(neural_networks)&oldid=871884348.
[21] “Introduction to Artificial Neural Networks - Part 1.” http://www.theprojectspot.com/tutorial-post/introduction-to-artificial-neural-networks-part-1/7 (accessed Jan. 21, 2020).
[22] F. M. Soares and A. M. F. Souza, Neural Network Programming with Java. Packt Publishing Ltd, 2017.
[23] “Receptive fields and functional architecture of monkey striate cortex - Hubel - 1968 - The Journal of Physiology - Wiley Online Library.” https://physoc.onlinelibrary.wiley.com/doi/abs/10.1113/jphysiol.1968.sp008455 (accessed Jan. 21, 2020).
[24] Y. LeCun et al., “Backpropagation Applied to Handwritten Zip Code Recognition,” Neural Comput., vol. 1, no. 4, pp. 541–551, Dec. 1989, doi: 10.1162/neco.1989.1.4.541.
[25] D. C. Cireşan, U. Meier, J. Masci, L. M. Gambardella, and J. Schmidhuber, “Flexible, High Performance Convolutional Neural Networks for Image Classification,” in Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence - Volume Volume Two, Barcelona, Catalonia, Spain, 2011, pp. 1237–1242, doi: 10.5591/978-1-57735-516-8/IJCAI11-210.
[26] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, “Large-scale Video Classification with Convolutional Neural Networks,” presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 1725–1732, Accessed: Jan. 21, 2020. [Online]. Available: https://www.cv-foundation.org/openaccess/content_cvpr_2014/html/Karpathy_Large-scale_Video_Classification_2014_CVPR_paper.html.
[27] D. Britz, “Understanding Convolutional Neural Networks for NLP,” WildML, Nov. 07, 2015. http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/ (accessed Jan. 21, 2020).
[28] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proc. IEEE, vol. 86, no. 11, pp. 2278–2324, Nov. 1998, doi: 10.1109/5.726791.
[29] “CS231n Convolutional Neural Networks for Visual Recognition.” http://cs231n.github.io/convolutional-networks/ (accessed Jan. 21, 2020).
[30] M. Ranzato, “Large-Scale Visual Recognition With Deep Learning,” p. 134.
[31] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” in Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2012, pp. 1097–1105.
[32] M. D. Zeiler and R. Fergus, “Visualizing and Understanding Convolutional Networks,” in Computer Vision – ECCV 2014, 2014, pp. 818–833.
[33] O. Russakovsky et al., “ImageNet Large Scale Visual Recognition Challenge,” Int. J. Comput. Vis., vol. 115, no. 3, pp. 211–252, Dec. 2015, doi: 10.1007/s11263-015-0816-y.
[34] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778, Accessed: Jan. 21, 2020. [Online]. Available: https://www.cv-foundation.org/openaccess/content_cvpr_2016/html/He_Deep_Residual_Learning_CVPR_2016_paper.html.
[35] J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual Losses for Real-Time Style Transfer and Super-Resolution,” in Computer Vision – ECCV 2016, 2016, pp. 694–711.
[36] S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,” Neural Comput., vol. 9, no. 8, pp. 1735–1780, Nov. 1997, doi: 10.1162/neco.1997.9.8.1735.
[37] F. A. Gers, J. Schmidhuber, and F. Cummins, “Learning to forget: continual prediction with LSTM,” pp. 850–855, Jan. 1999, doi: 10.1049/cp:19991218.
[38] C. Metz, “With QuickType, Apple wants to do more than guess your next text. It wants to give you an AI.,” Wired, Jun. 14, 2016.
[39] “A Beginner’s Guide to LSTMs and Recurrent Neural Networks,” Skymind. http://skymind.ai/wiki/lstm (accessed Jan. 21, 2020).
[40] “Nikhil Buduma | A Deep Dive into Recurrent Neural Nets,” The Musings of Nikhil Buduma. http://nikhilbuduma.com/2015/01/11/a-deep-dive-into-recurrent-neural-networks/ (accessed Jun. 13, 2020).
[41] F. A. Gers and J. Schmidhuber, “Recurrent nets that time and count,” in Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium, Jul. 2000, vol. 3, pp. 189–194 vol.3, doi: 10.1109/IJCNN.2000.861302.
[42] K. Yao, T. Cohn, K. Vylomova, K. Duh, and C. Dyer, “Depth-Gated LSTM,” ArXiv150803790 Cs, Aug. 2015, Accessed: Jan. 21, 2020. [Online]. Available: http://arxiv.org/abs/1508.03790.
[43] J. Koutník, K. Greff, F. Gomez, and J. Schmidhuber, “A Clockwork RNN,” ArXiv14023511 Cs, Feb. 2014, Accessed: Jan. 21, 2020. [Online]. Available: http://arxiv.org/abs/1402.3511.
[44] K. Greff, R. K. Srivastava, J. Koutník, B. R. Steunebrink, and J. Schmidhuber, “LSTM: A Search Space Odyssey,” IEEE Trans. Neural Netw. Learn. Syst., vol. 28, no. 10, pp. 2222–2232, Oct. 2017, doi: 10.1109/TNNLS.2016.2582924.
[45] R. Jozefowicz, W. Zaremba, and I. Sutskever, “An Empirical Exploration of Recurrent Network Architectures,” p. 9.
[46] B. Krause, L. Lu, I. Murray, and S. Renals, “Multiplicative LSTM for sequence modelling,” ArXiv160907959 Cs Stat, Oct. 2017, Accessed: Jun. 13, 2020. [Online]. Available: http://arxiv.org/abs/1609.07959.
[47] Y. Wu et al., “Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation,” ArXiv160908144 Cs, Oct. 2016, Accessed: Jun. 13, 2020. [Online]. Available: http://arxiv.org/abs/1609.08144.
[48] A. Graves, S. Fernández, and J. Schmidhuber, “Multi-dimensional Recurrent Neural Networks,” in Artificial Neural Networks – ICANN 2007, 2007, pp. 549–558.
[49] M. F. Stollenga, W. Byeon, M. Liwicki, and J. Schmidhuber, “Parallel Multi-Dimensional LSTM, With Application to Fast Biomedical Volumetric Image Segmentation,” in Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, Eds. Curran Associates, Inc., 2015, pp. 2998–3006.
[50] N. Kalchbrenner, I. Danihelka, and A. Graves, “Grid Long Short-Term Memory,” ArXiv150701526 Cs, Jul. 2015, Accessed: Jan. 21, 2020. [Online]. Available: http://arxiv.org/abs/1507.01526.
[51] M. Cord and P. Cunningham, Eds., Machine Learning Techniques for Multimedia: Case Studies on Organization and Retrieval. Berlin Heidelberg: Springer-Verlag, 2008.
[52] O. Bousquet, U. von Luxburg, and G. Ratsch, Advanced Lectures On Machine Learning: ML Summer Schools 2003, Canberra, Australia, February 2-14, 2003, Tubingen, Germany, August 4-16, 2003, Revised Lectures (Lecture Notes in Computer Science). SpringerVerlag, 2004.
[53] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, Mar. 2010, pp. 249–256, Accessed: Jan. 21, 2020. [Online]. Available: http://proceedings.mlr.press/v9/glorot10a.html.
[54] H. Robbins and S. Monro, “A Stochastic Approximation Method,” Ann. Math. Stat., vol. 22, no. 3, pp. 400–407, Sep. 1951, doi: 10.1214/aoms/1177729586.
[55] Y. N. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Y. Bengio, “Identifying and attacking the saddle point problem in high-dimensional non-convex optimization,” in Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2014, pp. 2933–2941.
[56] B. T. Polyak, “Some methods of speeding up the convergence of iteration methods,” USSR Comput. Math. Math. Phys., vol. 4, no. 5, pp. 1–17, Jan. 1964, doi: 10.1016/0041-5553(64)90137-5.
[57] S. Ruder, “An overview of gradient descent optimization algorithms,” ArXiv160904747 Cs, Jun. 2017, Accessed: Jun. 13, 2020. [Online]. Available: http://arxiv.org/abs/1609.04747.
[58] N. Qian, “On the momentum term in gradient descent learning algorithms,” Neural Netw. Off. J. Int. Neural Netw. Soc., vol. 12, no. 1, pp. 145–151, Jan. 1999.
[59] Y. NESTEROV, “A method for unconstrained convex minimization problem with the rate of convergence o(1/k^2),” Dokl. USSR, vol. 269, pp. 543–547, 1983, Accessed: Jan. 21, 2020. [Online]. Available: https://ci.nii.ac.jp/naid/20001173129/.
[60] Y. Bengio, N. Boulanger-Lewandowski, and R. Pascanu, “Advances in optimizing recurrent networks,” in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, May 2013, pp. 8624–8628, doi: 10.1109/ICASSP.2013.6639349.
[61] J. Duchi, E. Hazan, and Y. Singer, “Adaptive Subgradient Methods for Online Learning and Stochastic Optimization,” J. Mach. Learn. Res., vol. 12, no. Jul, pp. 2121–2159, 2011, Accessed: Jan. 21, 2020. [Online]. Available: http://www.jmlr.org/papers/v12/duchi11a.html.
[62] J. Dean et al., “Large Scale Distributed Deep Networks,” in Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2012, pp. 1223–1231.
[63] J. Pennington, R. Socher, and C. Manning, “Glove: Global Vectors for Word Representation,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, Oct. 2014, pp. 1532–1543, Accessed: Jan. 21, 2020. [Online]. Available: http://www.aclweb.org/anthology/D14-1162.
[64] M. D. Zeiler, “ADADELTA: An Adaptive Learning Rate Method,” ArXiv12125701 Cs, Dec. 2012, Accessed: Jan. 21, 2020. [Online]. Available: http://arxiv.org/abs/1212.5701.
[65] V. Bushaev, “Understanding RMSprop — faster neural network learning,” Towards Data Science, Sep. 02, 2018. https://towardsdatascience.com/understanding-rmsprop-faster-neural-network-learning-62e116fcf29a (accessed Jan. 21, 2020).
[66] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” ArXiv14126980 Cs, Dec. 2014, Accessed: Jan. 21, 2020. [Online]. Available: http://arxiv.org/abs/1412.6980.
[67] Z. Zhang, L. Ma, Z. Li, and C. Wu, “Normalized Direction-preserving Adam,” ArXiv170904546 Cs Stat, Sep. 2018, Accessed: Jun. 14, 2020. [Online]. Available: http://arxiv.org/abs/1709.04546.
[68] V. Bushaev, “Adam — latest trends in deep learning optimization.,” Medium, Oct. 24, 2018. https://towardsdatascience.com/adam-latest-trends-in-deep-learning-optimization-6be9a291375c (accessed Jun. 14, 2020).
[69] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A Simple Way to Prevent Neural Networks from Overfitting,” J. Mach. Learn. Res., vol. 15, pp. 1929–1958, 2014, Accessed: Jan. 21, 2020. [Online]. Available: http://jmlr.org/papers/v15/srivastava14a.html.
[14] J. Bayer, C. Osendorfer, D. Korhammer, N. Chen, S. Urban, and P. van der Smagt, “On Fast Dropout and its Applicability to Recurrent Networks,” ArXiv13110701 Cs Stat, Nov. 2013, Accessed: Jan. 21, 2020. [Online]. Available: http://arxiv.org/abs/1311.0701.
[70] V. Pham, T. Bluche, C. Kermorvant, and J. Louradour, “Dropout improves Recurrent Neural Networks for Handwriting Recognition,” ArXiv13124569 Cs, Nov. 2013, Accessed: Jan. 21, 2019. [Online]. Available: http://arxiv.org/abs/1312.4569.
[71] W. Zaremba, I. Sutskever, and O. Vinyals, “Recurrent Neural Network Regularization,” ArXiv14092329 Cs, Sep. 2014, Accessed: Jan. 21, 2020. [Online]. Available: http://arxiv.org/abs/1409.2329.
[72] S. Ioffe and C. Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,” ArXiv150203167 Cs, Feb. 2015, Accessed: Jan. 21, 2020. [Online]. Available: http://arxiv.org/abs/1502.03167.
[73] T. Cooijmans, N. Ballas, C. Laurent, Ç. Gülçehre, and A. Courville, “Recurrent Batch Normalization,” ArXiv160309025 Cs, Mar. 2016, Accessed: Jan. 21, 2020. [Online]. Available: http://arxiv.org/abs/1603.09025.
[74] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” ArXiv13112524 Cs, Nov. 2013, Accessed: Jan. 21, 2020. [Online]. Available: http://arxiv.org/abs/1311.2524.
[75] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “CNN Features off-the-shelf: an Astounding Baseline for Recognition,” ArXiv14036382 Cs, Mar. 2014, Accessed: Jan. 21, 2020. [Online]. Available: http://arxiv.org/abs/1403.6382.
[76] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarchical image database,” in 2009 IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2009, pp. 248–255, doi: 10.1109/CVPR.2009.5206848.
[77] C. Szegedy et al., “Going deeper with convolutions,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2015, pp. 1–9, doi: 10.1109/CVPR.2015.7298594.
[78] S. Zha, F. Luisier, W. Andrews, N. Srivastava, and R. Salakhutdinov, “Exploiting Image-trained CNN Architectures for Unconstrained Video Classification,” ArXiv150304144 Cs, Mar. 2015, Accessed: Jan. 21, 2020. [Online]. Available: http://arxiv.org/abs/1503.04144.
[79] X. Alameda-Pineda et al., “RAVEL: an annotated corpus for training robots with audiovisual abilities,” J. Multimodal User Interfaces, vol. 7, no. 1, pp. 79–91, Mar. 2013, doi: 10.1007/s12193-012-0111-y.
[80] Z. Xu, Y. Yang, and A. G. Hauptmann, “A Discriminative CNN Video Representation for Event Detection,” ArXiv14114006 Cs, Nov. 2014, Accessed: Jan. 21, 2020. [Online]. Available: http://arxiv.org/abs/1411.4006.
[81] H. Jégou, M. Douze, C. Schmid, and P. Pérez, “Aggregating local descriptors into a compact image representation,” in 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Jun. 2010, pp. 3304–3311, doi: 10.1109/CVPR.2010.5540039.
[82] Q. Li, Z. Qiu, T. Yao, T. Mei, Y. Rui, and J. Luo, “Action Recognition by Learning Deep Multi-Granular Spatio-Temporal Video Representation,” in Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval, New York, NY, USA, 2016, pp. 159–166, doi: 10.1145/2911996.2912001.
[83] J. Donahue et al., “Long-Term Recurrent Convolutional Networks for Visual Recognition and Description,” presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 2625–2634, Accessed: Jan. 21, 2020. [Online]. Available: https://www.cv-foundation.org/openaccess/content_cvpr_2015/html/Donahue_Long-Term_Recurrent_Convolutional_2015_CVPR_paper.html.
[84] L. Yao et al., “Describing Videos by Exploiting Temporal Structure,” ArXiv150208029 Cs Stat, Feb. 2015, Accessed: Jan. 21, 2020. [Online]. Available: http://arxiv.org/abs/1502.08029.
[85] A. Graves, A. Mohamed, and G. Hinton, “Speech Recognition with Deep Recurrent Neural Networks,” ArXiv13035778 Cs, Mar. 2013, Accessed: Jan. 21, 2020. [Online]. Available: http://arxiv.org/abs/1303.5778.
[86] Y. Jia et al., “Caffe: Convolutional Architecture for Fast Feature Embedding,” ArXiv14085093 Cs, Jun. 2014, Accessed: Jan. 21, 2020. [Online]. Available: http://arxiv.org/abs/1408.5093.
[87] “[1412.3555] Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling.” https://arxiv.org/abs/1412.3555 (accessed Jan. 21, 2020).
[88] N. Léonard, S. Waghmare, Y. Wang, and J.-H. Kim, “rnn : Recurrent Library for Torch,” ArXiv151107889 Cs, Nov. 2015, Accessed: Jan. 21, 2020. [Online]. Available: http://arxiv.org/abs/1511.07889.
[89] V. Mnih, N. Heess, A. Graves, and K. Kavukcuoglu, “Recurrent Models of Visual Attention,” ArXiv14066247 Cs Stat, Jun. 2014, Accessed: Jan. 21, 2020. [Online]. Available: http://arxiv.org/abs/1406.6247.
[90] J. Ba, V. Mnih, and K. Kavukcuoglu, “Multiple Object Recognition with Visual Attention,” ArXiv14127755 Cs, Dec. 2014, Accessed: Jan. 21, 2020. [Online]. Available: http://arxiv.org/abs/1412.7755.
[91] Z. Wu, X. Wang, Y.-G. Jiang, H. Ye, and X. Xue, “Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification,” ArXiv150401561 Cs, Apr. 2015, Accessed: Jan. 21, 2020. [Online]. Available: http://arxiv.org/abs/1504.01561.
[92] J. Y.-H. Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici, “Beyond Short Snippets: Deep Networks for Video Classification,” ArXiv150308909 Cs, Mar. 2015, Accessed: Jan. 21, 2020. [Online]. Available: http://arxiv.org/abs/1503.08909.
[93] V. Veeriah, N. Zhuang, and G.-J. Qi, “Differential Recurrent Neural Networks for Action Recognition,” ArXiv150406678 Cs, Apr. 2015, Accessed: Jan. 21, 2020. [Online]. Available: http://arxiv.org/abs/1504.06678.
[94] Z. Wu, Y.-G. Jiang, X. Wang, H. Ye, and X. Xue, “Multi-Stream Multi-Class Fusion of Deep Networks for Video Classification,” in Proceedings of the 24th ACM International Conference on Multimedia, New York, NY, USA, 2016, pp. 791–800, doi: 10.1145/2964284.2964328.
[95] S. Ji, W. Xu, M. Yang, and K. Yu, “3D Convolutional Neural Networks for Human Action Recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 1, pp. 221–231, Jan. 2013, doi: 10.1109/TPAMI.2012.59.
[96] H. Wang and C. Schmid, “Action Recognition with Improved Trajectories,” in 2013 IEEE International Conference on Computer Vision, Dec. 2013, pp. 3551–3558, doi: 10.1109/ICCV.2013.441.
[97] L. Sun, K. Jia, D.-Y. Yeung, and B. E. Shi, “Human Action Recognition using Factorized Spatio-Temporal Convolutional Networks,” ArXiv151000562 Cs, Oct. 2015, Accessed: Jan. 21, 2020. [Online]. Available: http://arxiv.org/abs/1510.00562.
[98] K. Simonyan and A. Zisserman, “Two-Stream Convolutional Networks for Action Recognition in Videos,” in Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2014, pp. 568–576.
[99] L. Wang, Y. Qiao, and X. Tang, “Action Recognition with Trajectory-Pooled Deep-Convolutional Descriptors,” 2015 IEEE Conf. Comput. Vis. Pattern Recognit. CVPR, pp. 4305–4314, Jun. 2015, doi: 10.1109/CVPR.2015.7299059.
[100] C. Feichtenhofer, A. Pinz, and A. Zisserman, “Convolutional Two-Stream Network Fusion for Video Action Recognition,” ArXiv160406573 Cs, Apr. 2016, Accessed: Jan. 21, 2020. [Online]. Available: http://arxiv.org/abs/1604.06573.
[101] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning Spatiotemporal Features with 3D Convolutional Networks,” ArXiv14120767 Cs, Dec. 2014, Accessed: Apr. 28, 2020. [Online]. Available: http://arxiv.org/abs/1412.0767.
[102] “Home - Keras Documentation.” https://keras.io/ (accessed Jan. 18, 2020).
[103] “Understanding LSTM Networks -- colah’s blog.” http://colah.github.io/posts/2015-08-Understanding-LSTMs/ (accessed Apr. 28, 2020).
[104] V.-M. Khong and T.-H. Tran, “Improving Human Action Recognition with Two-Stream 3D Convolutional Neural Network,” in 2018 1st International Conference on Multimedia Analysis and Pattern Recognition (MAPR), Apr. 2018, pp. 1–6, doi: 10.1109/MAPR.2018.8337518.
[105] N. L. Hakim, T. K. Shih, S. P. Kasthuri Arachchi, W. Aditya, Y.-C. Chen, and C.-Y. Lin, “Dynamic Hand Gesture Recognition Using 3DCNN and LSTM with FSM Context-Aware Model,” Sensors, vol. 19, no. 24, p. 5429, Jan. 2020, doi: 10.3390/s19245429.
[106] H. Phan et al., “Beyond Equal-Length Snippets: How Long is Sufficient to Recognize an Audio Scene?,” ArXiv181101095 Cs Eess, Nov. 2018, Accessed: Apr. 28, 2020. [Online]. Available: http://arxiv.org/abs/1811.01095.
[107] “Serre Lab » HMDB: a large human motion database.” http://serre-lab.clps.brown.edu/resource/hmdb-a-large-human-motion-database/ (accessed Jan. 17, 2020).
[108] G. Farnebäck, “Two-Frame Motion Estimation Based on Polynomial Expansion,” in Image Analysis, 2003, pp. 363–370.
[109]“OpenCV:OpticalFlow.” https://docs.opencv.org/3.4/d7/d8b/tutorial_py_lucas_kanade.html (accessed Jan. 21, 2020).
[110] C. Igel and M. Hüsken, “Improving the Rprop Learning Algorithm,” 2000.
[112] S. P. K. Arachchi, T. K. Shih, C.-Y. Lin, and G. Wijayarathna, “Deep Learning-Based Firework Video Pattern Classification,” J. Internet Technol., vol. 20, no. 7, pp. 2033–2042, Dec. 2020, Accessed: Jan. 17, 2020. [Online]. Available: https://jit.ndhu.edu.tw/article/view/2190.
[113] C. Feichtenhofer, A. Pinz, and R. P. Wildes, “Spatiotemporal Residual Networks for Video Action Recognition,” ArXiv161102155 Cs, Nov. 2016, Accessed: Jan. 18, 2020. [Online]. Available: http://arxiv.org/abs/1611.02155.

指導教授

施國琛教授(Prof. Timothy K. Shih)

審核日期

2020-7-3

推文