兩階段自注意力機制於高階特徵之動作識別

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：10

、訪客IP：13.58.3.158

姓名

鄭媛(Yuan Cheng) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

兩階段自注意力機制於高階特徵之動作識別
(Lightweight Informative Feature with Residual Decoupling Self-Attention for Action Recognition)

相關論文

★ 影片指定對象臉部置換系統	★ 以單一攝影機實現單指虛擬鍵盤之功能
★ 基於視覺的手寫軌跡注音符號組合辨識系統	★ 利用動態貝氏網路在空照影像中進行車輛偵測
★ 以視訊為基礎之手寫簽名認證	★ 使用膚色與陰影機率高斯混合模型之移動膚色區域偵測
★ 影像中賦予信任等級的群眾切割	★ 航空監控影像之區域切割與分類
★ 在群體人數估計應用中使用不同特徵與回歸方法之分析比較	★ 以視覺為基礎之強韌多指尖偵測與人機介面應用
★ 在夜間受雨滴汙染鏡頭所拍攝的影片下之車流量估計	★ 影像特徵點匹配應用於景點影像檢索
★ 自動感興趣區域切割及遠距交通影像中的軌跡分析	★ 基於回歸模型與利用全天空影像特徵和歷史資訊之短期日射量預測
★ Analysis of the Performance of Different Classifiers for Cloud Detection Application	★ 全天空影像之雲追蹤與太陽遮蔽預測

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 (2027-7-20以後開放)

摘要(中)

動作識別(Action Recognition)是電腦視覺研究中一門較基礎之領域，由於能延伸的應用非常眾多，因此也是現今仍需不斷精進的技術。隨著深度學習技術不斷地發展，許多圖像識別的研究方法也不斷更新與進步，而這些技術也能套用在動作識別領域以增加其準確率及穩健性，因此本篇論文致力於將許多新提出之方法部分應用於現有的基礎模型並修改其架構，對該基礎模型進行優化。
本篇論文使用Facebook AI Research提出之SlowFast網路作為欲修改之模型基礎，並參考Actor-Context-Actor Relation Network(ACAR Net)處理高階特徵的概念，提出了Informative Feature with Residual Self-Attention module(IFRSA)，並使用了MobileNet提出之separable convolution取代部分卷積層，衍生出輕量化版本Lightweight IFRSA (LIFRSA)，且IFRSA中的自注意力機制(self-attention)也以兩階段自注意力機制(decoupling self-attention)取代之，提出了Lightweight Informative Feature with Residual Decoupling Self-Attention(LIFRDeSA)。
根據實驗結果，本篇論文所提出之方法除了提升了基礎模型的準確率外，同時也考量了模型所需的計算資源，提出輕量且準確率更高之架構。

摘要(英)

Action Recognition aims to detect and classify the actions of one or more people in the video, and it can be connected to many different fields and provide several applications, so the accuracy of this basic task becomes an important part for these related researches. Therefore, we focus on enhancing the accuracy of previous work in this paper and manage to reduce its computational cost.
The base model we used is SlowFast Network, which was a state-of-the-art. We refer to the concept of extracting high-level feature method in Actor-Context-Actor Relation Network(ACAR Net), and propose Informative Feature with Residual Self-Attention module(IFRSA). But the computational cost is very huge, so we first use the separable convolution, which was presented in MobileNet, to replace some convolution in this module. Secondly, the self-attention layer is substituted for decoupling self-attention, then we present Lightweight Informative Feature with Residual Decoupling Self-Attention (LIFRDeSA).
Experiment on AVA dataset shows that the LIFRDeSA module enhance the accuracy of the baseline, and meanwhile concerning about the computational cost. The model we propose has higher accuracy than the baseline, and the additional part is very lightweight.

關鍵字(中)

★ 動作識別
★ 自注意力機制
★ 輕量化

關鍵字(英)

★ action recognition
★ self-attention
★ lightweight

論文目次

目錄
摘要 I
Abstract II
目錄 III
圖目錄 V
表目錄 VI
第一章緒論 1
1.1 研究背景與動機 1
1.2 論文架構 2
第二章文獻回顧 4
2.1資料集 4
2.1.1 AVA Dataset 4
2.1.2 Kinetics-400 5
2.2 基於深度學習之動作識別發展 6
2.2.1 ConvNet+LSTM 6
2.2.2 Convolutional 3D (C3D) 8
2.2.3 Inflated 3D ConvNet (I3D) 9
2.2.4 SlowFast 9
2.3 Relational Reasoning 14
2.3.1 Actor-centric Relation Network 14
2.3.2 Actor-Context-Actor Relation Network 15
2.4 Separable Convolution 18
第三章研究方法 19
3.1 Actor Features and Global Features 19
3.2 Informative Feature with Residual Self-Attention Module(IFRSA) 21
3.3 Lightweight Informative Feature with Residual Self-Attention Module (LIFRSA) 24
3.4 Lightweight Informative Feature with Residual Decoupling Self-Attention (LIFRDeSA) 26
第四章實驗結果 28
4.1 設備環境 28
4.2 資料集 29
4.3 實作細節 29
4.3.1 SlowFast 29
4.3.2 Relational Reasoning Model及IFRSA實作 31
4.4 消融實驗(Ablation Experiment) 32
4.4.1 Comparison of Fusing Different Low-Level Feature 32
4.4.2 IFRSA與Lightweight IFRSA(LIFRSA) 33
4.4.3 LIFRSA與LIFRDeSA 34
4.4.4 Different size of self-attention feature map 35
4.4.5 Position Encoding 36
4.5 Comparison with state-of-the-art 37
4.6 Qualitative Result 38
第五章結論 42
參考文獻 43

參考文獻

Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. Slowfast networks for video recognition. In Proceedings of the IEEE International Conference on Computer Vision, pages 6202–6211, 2019.
Junting Pan, Siyu Chen, Mike Zheng Shou, Yu Liu, Jing Shao, and Hongsheng Li. Actor-context-actor relation network for spatio-temporal action localization. In Proc. CVPR, 2021.
Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision (2015).
Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In Proc. CVPR, 2017.
hunhui Gu, Chen Sun, David A Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, et al. Ava: A video dataset of spatio-temporally localized atomic visual actions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6047–6056, 2018.
J. Y.-H. Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici, “Beyond short snippets: Deep networks for video classification,” in CVPR, 2015.
J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2625–2634, 2015.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proc. CVPR, 2016
Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset. arXiv:1705.06950, 2017.
Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." Advances in neural information processing systems 25 (2012).
Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition." arXiv preprint arXiv:1409.1556 (2014).
Deng, Jia, et al. "Imagenet: A large-scale hierarchical image database." 2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009.
Huang, Gao, et al. "Densely connected convolutional networks." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.
Ji Lin, Chuang Gan, and Song Han. Tsm: Temporal shift module for efficient video understanding. In Proceedings of the IEEE International Conference on Computer Vision, pages 7083–7093, 2019.
Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4694–4702, 2015.
Szegedy, Christian, et al. "Going deeper with convolutions." Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.
Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1933–1941, 2016.
Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems, pages 568–576, 2014.
Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision, pages 20–36. Springer, 2016.
Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7794–7803, 2018.
Xiaolong Wang and Abhinav Gupta. Videos as space-time region graphs. In Proceedings of the European Conference on Computer Vision (ECCV), pages 399–417, 2018.
Chen Sun, Abhinav Shrivastava, Carl Vondrick, Kevin Murphy, Rahul Sukthankar, and Cordelia Schmid. Actor-centric relation network. In Proceedings of the European Conference on Computer Vision (ECCV), pages 318–334, 2018.
Chen Sun, Abhinav Shrivastava, Carl Vondrick, Rahul Sukthankar, Kevin Murphy, and Cordelia Schmid. Relational action forecasting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 273–283, 2019.
Chao-Yuan Wu, Christoph Feichtenhofer, Haoqi Fan, Kaiming He, Philipp Krahenbuhl, and Ross Girshick. Long-term feature banks for detailed video understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 284–293, 2019.
Jiajun Tang, Jin Xia, Xinzhi Mu, Bo Pang, and Cewu Lu. Asynchronous interaction aggregation for action detection. arXiv preprint arXiv:2004.07485, 2020.
Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In Proc. CVPR, 2017.
Ba, Jimmy Lei, Jamie Ryan Kiros, and Geoffrey E. Hinton. "Layer normalization." arXiv preprint arXiv:1607.06450 (2016).
Agarap, Abien Fred. "Deep learning using rectified linear units (relu)." arXiv preprint arXiv:1803.08375 (2018).
Ren, Shaoqing, et al. "Faster r-cnn: Towards real-time object detection with region proposal networks." Advances in neural information processing systems 28 (2015).

指導教授

鄭旭詠(HSU-YUNG CHENG)

審核日期

2022-7-26

推文