分層式卷積神經網路時空混合之動作識別

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：22

、訪客IP：3.143.0.65

姓名

李晨宇(Chen-Yu Lee) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

分層式卷積神經網路時空混合之動作識別
(Separable ConvNet Spatiotemporal Mixer)

相關論文

★ 影片指定對象臉部置換系統	★ 以單一攝影機實現單指虛擬鍵盤之功能
★ 基於視覺的手寫軌跡注音符號組合辨識系統	★ 利用動態貝氏網路在空照影像中進行車輛偵測
★ 以視訊為基礎之手寫簽名認證	★ 使用膚色與陰影機率高斯混合模型之移動膚色區域偵測
★ 影像中賦予信任等級的群眾切割	★ 航空監控影像之區域切割與分類
★ 在群體人數估計應用中使用不同特徵與回歸方法之分析比較	★ 以視覺為基礎之強韌多指尖偵測與人機介面應用
★ 在夜間受雨滴汙染鏡頭所拍攝的影片下之車流量估計	★ 影像特徵點匹配應用於景點影像檢索
★ 自動感興趣區域切割及遠距交通影像中的軌跡分析	★ 基於回歸模型與利用全天空影像特徵和歷史資訊之短期日射量預測
★ Analysis of the Performance of Different Classifiers for Cloud Detection Application	★ 全天空影像之雲追蹤與太陽遮蔽預測

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 (2028-7-16以後開放)

摘要(中)

影片識別為電腦科學研究的重要領域，延伸應用相當廣泛，輕量的模型與高準確度辨識模型是大家所追求的目標，先進的影片辨識模型作為多任務模型的骨幹能帶來顯著的辨識率提升，降低運算量同時使硬體端負擔減輕，因此開發高效的影片辨識模型應用於多任務模型的影片特徵提取器，讓眾多任務模型能增加準確率、處理速度、穩健性、輕量化；因此本篇論文致力於開發一個新穎的影像辨識模型能達到上述要求，讓其成為多任務影片特徵提取的骨幹。
本篇論文開發的新穎影片辨識模型，分層式卷積神經網路時空混合之動作識別模型(SCSM)，新穎的壓縮分層時空融合方式，整體架構分為兩大部分，空間域與時間域，空間域透過連續的圖片特徵提取器，萃取出時空特徵，時間域將時空特徵加以混和，學習不同空間位置上的資訊與不同時間的特徵資訊，將時間與空間兩者的資訊混合後得出所需的訊息。
根據實驗結果，本篇論文所提出的SCSM架構相比當今的眾多的影片辨識模型中能在極低參數量與運算量的情況下仍擁有與SOTA模型匹敵的辨識率並且擁有強大的可擴展性與遷移學習能力。

摘要(英)

Video recognition is an important field in computer science. A novel video recognition models can significantly improve accuracy as the backbone of multi-task models while reducing computational load, making hardware easier to manage. Therefore, developing an efficient video recognition backbone that can be applied to multi-task models to increase accuracy, processing speed, robustness, and lightweight.
Ours aim to develop a novel action recognition model that is lightweight with high accuracy, making it the primary backbone of feature extraction on video for multi-task Learning. It named “Separable ConvNet Spatiotemporal Mixer’’ (SCSM). A new hierarchical spatial compression with spatiotemporal fusion method. The architecture consists of two parts: spatial domain and temporal domain. The spatial domain utilizes consecutive feature extractors to extract frame to feature, while the temporal domain fuses spatiotemporal features, learning information on different spatial positions and temporal feature information.
According to experimental results, SCSM proposed in extremely low parameter and computational complexity compared to others so far, and has strong scalability and transfer learning. Achieving video recognition accuracy comparable to SOTA models.

關鍵字(中)

★ 動作識別
★ 時空混合
★ 輕量化
★ 影片辨識

關鍵字(英)

★ Action recognition
★ Spatiotemporal fusio
★ Lightweight
★ Video recognition

論文目次

摘要 I
Abstract II
目錄 III
圖目錄 VI
表目錄 VIII
第一章緒論 1
1.1 研究背景與動機 1
1.2 論文架構 3
第二章文獻回顧 4
2.1資料集 4
2.1.1 ImageNet 4
2.1.2 Kinetics-400 5
2.2 基於深度學習之動作識別發展 6
2.2.1 Two-Stream ConvNet 6
2.2.2 C3D 7
2.2.3 CNN+LSTM 7
2.2.4 I3D 8
2.2.5 R(2+1)D 9
2.2.6 SlowFast 10
2.2.7 VIVIT 11
2.3 重要網路架構 13
2.3.1 ResNet 13
2.3.2 VIT 15
2.3.3 Swin Transformer 16
2.3.4 ConvNeXt 18
2.3.5 Transformer 20
2.3.6 MLP-Mixer 22
第三章研究方法 24
3.1 模型架構 24
3.2 前處理 25
3.2.1 加速前處理 26
3.2.2 時間幀取樣方法 26
3.2.3空間前處理方法 27
3.3 模型設計 28
3.3.1 壓縮輸入方法 28
3.3.2 空間域 30
3.3.3 時間域 31
3.3.4 MLP Head 34
第四章實驗結果 35
4.1 設備環境 35
4.2 資料集 36
4.3 實作細節 36
4.4 消融實驗(Ablation Experiment) 37
4.4.1 Temporal sampling type 37
4.4.2 Various Spatiotemporal Mechanism 38
4.4.3 Spatial extractor 39
4.4.4 Patch size 41
4.4.5 Input frame analysis 42
4.4.6 N layers 43
4.4.7 Time wrapper 43
4.4.8 ImageNet pretrain 44
4.5 Comparison with state-of-the-art 45
第五章結論 47
參考文獻 48

參考文獻

[1] K. Simonyan and A. Zisserman, "Two-stream convolutional networks for action recognition in videos," Advances in neural information processing systems, vol. 27, 2014.
[2] D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri, "A closer look at spatiotemporal convolutions for action recognition," in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2018, pp. 6450-6459.
[3] J. Carreira and A. Zisserman, "Quo vadis, action recognition? a new model and the kinetics dataset," in proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6299-6308.
[4] C. Szegedy et al., "Going deeper with convolutions," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1-9.
[5] C. Feichtenhofer, H. Fan, J. Malik, and K. He, "Slowfast networks for video recognition," in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 6202-6211.
[6] A. Vaswani et al., "Attention is all you need," Advances in neural information processing systems, vol. 30, 2017.
[7] A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lučić, and C. Schmid, "Vivit: A video vision transformer," in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 6836-6846.
[8] Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, and S. Xie, "A convnet for the 2020s," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 11976-11986.
[9] I. O. Tolstikhin et al., "Mlp-mixer: An all-mlp architecture for vision," Advances in neural information processing systems, vol. 34, pp. 24261-24272, 2021.
[10] A. G. Howard et al., "Mobilenets: Efficient convolutional neural networks for mobile vision applications," arXiv preprint arXiv:1704.04861, 2017.
[11] O. Russakovsky et al., "Imagenet large scale visual recognition challenge," International journal of computer vision, vol. 115, pp. 211-252, 2015.
[12] K. Soomro, A. R. Zamir, and M. Shah, "UCF101: A dataset of 101 human actions classes from videos in the wild," arXiv preprint arXiv:1212.0402, 2012.
[13] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, "HMDB: a large video database for human motion recognition," in 2011 International conference on computer vision, 2011: IEEE, pp. 2556-2563.
[14] F. Caba Heilbron, V. Escorcia, B. Ghanem, and J. Carlos Niebles, "Activitynet: A large-scale video benchmark for human activity understanding," in Proceedings of the ieee conference on computer vision and pattern recognition, 2015, pp. 961-970.
[15] W. Kay et al., "The kinetics human action video dataset," arXiv preprint arXiv:1705.06950, 2017.
[16] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, "Learning spatiotemporal features with 3d convolutional networks," in Proceedings of the IEEE international conference on computer vision, 2015, pp. 4489-4497.
[17] J. Donahue et al., "Long-term recurrent convolutional networks for visual recognition and description," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 2625-2634.
[18] J. Bridle, "Training stochastic model recognition algorithms as networks can lead to maximum mutual information estimation of parameters," Advances in neural information processing systems, vol. 2, 1989.
[19] S. Ji, W. Xu, M. Yang, and K. Yu, "3D convolutional neural networks for human action recognition," IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 1, pp. 221-231, 2012.
[20] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-based learning applied to document recognition," Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324, 1998.
[21] S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural computation, vol. 9, no. 8, pp. 1735-1780, 1997.
[22] A. Dosovitskiy et al., "An image is worth 16x16 words: Transformers for image recognition at scale," arXiv preprint arXiv:2010.11929, 2020.
[23] K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770-778.
[24] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "Bert: Pre-training of deep bidirectional transformers for language understanding," arXiv preprint arXiv:1810.04805, 2018.
[25] Z. Liu et al., "Swin transformer: Hierarchical vision transformer using shifted windows," in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10012-10022.
[26] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jégou, "Training data-efficient image transformers & distillation through attention," in International conference on machine learning, 2021: PMLR, pp. 10347-10357.
[27] I. Loshchilov and F. Hutter, "Decoupled weight decay regularization," arXiv preprint arXiv:1711.05101, 2017.
[28] V. Nair and G. E. Hinton, "Rectified linear units improve restricted boltzmann machines," in Proceedings of the 27th international conference on machine learning (ICML-10), 2010, pp. 807-814.
[29] D. Hendrycks and K. Gimpel, "Gaussian error linear units (gelus)," arXiv preprint arXiv:1606.08415, 2016.
[30] S. Ioffe and C. Szegedy, "Batch normalization: Accelerating deep network training by reducing internal covariate shift," in International conference on machine learning, 2015: pmlr, pp. 448-456.
[31] J. L. Ba, J. R. Kiros, and G. E. Hinton, "Layer normalization," arXiv preprint arXiv:1607.06450, 2016.
[32] G. Bertasius, H. Wang, and L. Torresani, "Is space-time attention all you need for video understanding?," in ICML, 2021, vol. 2, no. 3, p. 4.
[33] J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, and R. Shah, "Signature verification using a" siamese" time delay neural network," Advances in neural information processing systems, vol. 6, 1993.
[34] I. Radosavovic, R. P. Kosaraju, R. Girshick, K. He, and P. Dollár, "Designing network design spaces," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 10428-10436.
[35] M. Tan and Q. Le, "Efficientnet: Rethinking model scaling for convolutional neural networks," in International conference on machine learning, 2019: PMLR, pp. 6105-6114.
[36] Z. Liu et al., "Swin transformer v2: Scaling up capacity and resolution," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 12009-12019.
[37] T. Elsken, J. H. Metzen, and F. Hutter, "Neural architecture search: A survey," The Journal of Machine Learning Research, vol. 20, no. 1, pp. 1997-2017, 2019.
[38] K. Hara, H. Kataoka, and Y. Satoh, "Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?," in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2018, pp. 6546-6555.
[39] J. Lin, C. Gan, and S. Han, "Tsm: Temporal shift module for efficient video understanding," in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 7083-7093.
[40] D. Tran, H. Wang, L. Torresani, and M. Feiszli, "Video classification with channel-separated convolutional networks," in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 5552-5561.

指導教授

鄭旭詠(Hsu-Yung Cheng)

審核日期

2023-7-24

推文