單鏡頭圖像的手部網格重建

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：13

、訪客IP：18.117.158.147

姓名

江嘉揚(Jia-Yang Jiang) 查詢紙本館藏

畢業系所

電機工程學系

論文名稱

單鏡頭圖像的手部網格重建
(3D Hand Mesh Reconstruction from Monocular Image)

相關論文

★ 即時的SIFT特徵點擷取之低記憶體硬體設計	★ 即時的人臉偵測與人臉辨識之門禁系統
★ 具即時自動跟隨功能之自走車	★ 應用於多導程心電訊號之無損壓縮演算法與實現
★ 離線自定義語音語者喚醒詞系統與嵌入式開發實現	★ 晶圓圖缺陷分類與嵌入式系統實現
★ 語音密集連接卷積網路應用於小尺寸關鍵詞偵測	★ G2LGAN: 對不平衡資料集進行資料擴增應用於晶圓圖缺陷分類
★ 補償無乘法數位濾波器有限精準度之演算法設計技巧	★ 可規劃式維特比解碼器之設計與實現
★ 以擴展基本角度CORDIC為基礎之低成本向量旋轉器矽智產設計	★ JPEG2000靜態影像編碼系統之分析與架構設計
★ 適用於通訊系統之低功率渦輪碼解碼器	★ 應用於多媒體通訊之平台式設計
★ 適用MPEG 編碼器之數位浮水印系統設計與實現	★ 適用於視訊錯誤隱藏之演算法開發及其資料重複使用考量

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 (2026-3-1以後開放)

摘要(中)

這些年來隨著深度學習對人們生活的影響越來越大，人們越來越重視這個領域的發展。其中從單鏡頭rgb圖像預測人的手部pose和shape的任務一直是計算機視覺領域長期存在的問題。不同於常見的手部姿態預測那樣只預測手部骨架點的坐標，這項任務要還原出手部原本的外形。很多地方都會應用到這個任務例如增強現實（augmented reality）和虛擬現實（virtual reality），但是由於手部佔圖像面積較小的部分，手部動作靈活多樣且容易遮擋，所以這項任務任然非常具有挑戰性。
我們在本文中提出來一種完整的端到端網路架構，可以從rgb手部圖像得到3D mesh的手部形狀。具體地來說，在編碼器的部分，我們使用的是ResNet-50來提取圖像特征，為了後面更好的回歸模型參數，我們通過一些卷積層得到一些2D的特征圖，例如2D heatmap和mask圖像。在模型參數回歸的部分我們使用了全連接層用迭代回歸的方式進行模型參數的回歸。因為model-base的方法生成的手部模型都會有一些缺陷，例如不夠自然。所以最後我們添加了手部mesh坐標修正的部分，我們把模型生成的手部模型（MANO）當做粗糙的初始手部模型，接著添加進前面網路的一些特征，進入圖卷積網路層回歸出每個坐標點的偏移量，最後加到初始手部模型上得到最終的手部模型。

摘要(英)

In recent years, as the impact of deep learning on people′s lives has grown, more and more attention has been paid to the development of this field. The task of human hand pose and shape estimation from a rgb image has been a long-standing problem in the field of computer vision. Unlike common hand posture prediction, which only predicts the coordinates of the skeletal points of the hand, this task restores the original shape of the hand. Many places will apply this task such as augmented reality and virtual reality, but the task is still very challenging because the hand occupies a relatively small part of the image area, and the hand movements are flexible and easy to block.
In this paper we propose a complete end-to-end network architecture to obtain 3D mesh hand shapes from rgb hand images. Specifically, in the encoder part, we use ResNet-50 to extract the image features, and for better regression of model parameters later, we obtain some 2D feature maps, such as 2D heatmap and mask images, through some convolutional layers. In the model parameter regression part, we use the fully connected layer for iterative regression of the model parameters. Because the hand models generated by the model-base method have some defects, such as not natural enough. So finally we add the hand mesh coordinate correction part, we treat the hand model (MANO) generated by the model as the rough initial hand model, then add some features from the previous network, enter the graphical convolutional network layer to regress the offset of each coordinate point, and finally add it to the initial hand model to get the final hand model.

關鍵字(中)

★ 卷積神經網路
★ 手部網格重建

關鍵字(英)

論文目次

摘要 I
ABSTRACT II
1. 序論 1
1.1. 研究背景與動機 1
1.2. 研究方向與研究貢獻 3
1.3. 論文架構 4
2. 文獻探討 5
2.1. 3D手部pose估計 5
2.2. DEPTH BASE 3D手部POSE和SHAPE估計 8
2.3. MONOCULAR BASE 3D手部POSE和SHAPE估計 11
3. 網路模型設計與實驗 17
3.1. 手部模型 17
3.2. 神經網路架構 20
3.3. EXTRACT FEATURES PART 21
3.4. MANO PART 23
3.5. ADJUST PART 24
3.6. 組合式的損失函數 25
4. 實驗結果與討論 27
4.1. FREIHAND資料集 27
4.2. 訓練與實作細節 28
4.3. 模型比較 31
4.4. 消融實驗 33
5. 結論 36
參考文獻 37

參考文獻

[1] C. Zimmermann and T. Brox, "Learning to Estimate 3D Hand Pose from Single RGB Images," 2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 4913-4921, doi: 10.1109/ICCV.2017.525.
[2] A. Spurr, J. Song, S. Park and O. Hilliges, "Cross-Modal Deep Variational Hand Pose Estimation," 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 89-98, doi: 10.1109/CVPR.2018.00017.
[3] F. Mueller et al., "GANerated Hands for Real-Time 3D Hand Tracking from Monocular RGB," 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 49-59, doi: 10.1109/CVPR.2018.00013.
[4] P. Panteleris, I. Oikonomidis and A. Argyros, "Using a Single RGB Frame for Real Time 3D Hand Pose Estimation in the Wild," 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), 2018, pp. 436-445, doi: 10.1109/WACV.2018.00054.
[5] Liuhao Ge, Hui Liang, Junsong Yuan, Daniel Thalmann，“Robust 3D hand pose estimation in single depth images: from single-view CNN to multi-view CNNs,”Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 3593-3601
[6] L. Ge, H. Liang, J. Yuan and D. Thalmann, "Real-Time 3D Hand Pose Estimation with 3D Convolutional Neural Networks," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 4, pp. 956-970, 1 April 2019, doi: 10.1109/TPAMI.2018.2827052.
[7] L. Ge, Y. Cai, J. Weng and J. Yuan, "Hand PointNet: 3D Hand Pose Estimation Using Point Sets," 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 8417-8426, doi: 10.1109/CVPR.2018.00878.
[8] L. Ge, H. Liang, J. Yuan and D. Thalmann, "Robust 3D Hand Pose Estimation From Single Depth Images Using Multi-View CNNs," in IEEE Transactions on Image Processing, vol. 27, no. 9, pp. 4422-4436, Sept. 2018, doi: 10.1109/TIP.2018.2834824.
[9] Javier Romero, Dimitrios Tzionas, and Michael J Black. Embodied hands: Modeling and capturing hands and bodies together. ACM Transactions on Graphics, 2017.
[10] Lim, G.M., Jatesiktat, P., Ang, W.T. (2020). "MobileHand: Real-Time 3D Hand Shape and Pose Estimation from Color Image," In: Yang, H., Pasupa, K., Leung, A.CS., Kwok, J.T., Chan, J.H., King, I. (eds) Neural Information Processing. ICONIP 2020. Communications in Computer and Information Science, vol 1332. Springer, Cham. https://doi.org/10.1007/978-3-030-63820-7_52
[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, “Deep Residual Learning for Image Recognition,” In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
[12] C. Zimmermann, D. Ceylan, J. Yang, B. Russell, M. J. Argus and T. Brox, "FreiHAND: A Dataset for Markerless Capture of Hand Pose and Shape From Single RGB Images," 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 813-822, doi: 10.1109/ICCV.2019.00090.
[13] J. Y. Chang, G. Moon and K. M. Lee, "V2V-PoseNet: Voxel-to-Voxel Prediction Network for Accurate 3D Hand and Human Pose Estimation from a Single Depth Map," 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 5079-5088, doi: 10.1109/CVPR.2018.00533.
[14] P. Panteleris, I. Oikonomidis and A. Argyros, "Using a Single RGB Frame for Real Time 3D Hand Pose Estimation in the Wild," 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), 2018, pp. 436-445, doi: 10.1109/WACV.2018.00054.
[15] L. Yang and A. Yao, "Disentangling Latent Hands for Image Synthesis and Pose Estimation," 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 9869-9878, doi: 10.1109/CVPR.2019.01011.
[16] S. Li and D. Lee, "Point-To-Pose Voting Based Hand Pose Estimation Using Residual Permutation Equivariant Layer," 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 11919-11928, doi: 10.1109/CVPR.2019.01220.
[17] Y. Cai et al., "Exploiting Spatial-Temporal Relationships for 3D Pose Estimation via Graph Convolutional Networks," 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 2272-2281, doi: 10.1109/ICCV.2019.00236.
[18] Y. Chen, Z. Tu, L. Ge, D. Zhang, R. Chen and J. Yuan, "SO-HandNet: Self-Organizing Network for 3D Hand Pose Estimation With Semi-Supervised Learning," 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 6960-6969, doi: 10.1109/ICCV.2019.00706.
[19] C. Wan, T. Probst, L. Van Gool and A. Yao, "Self-Supervised 3D Hand Pose Estimation Through Training by Fitting," 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 10845-10854, doi: 10.1109/CVPR.2019.01111.
[20] S. Baek, K. I. Kim and T. -K. Kim, "Weakly-Supervised Domain Adaptation via GAN and Mesh Model for Estimating 3D Hand Poses Interacting Objects," 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 6120-6130, doi: 10.1109/CVPR42600.2020.00616.
[21] L. Zhao, X. Peng, Y. Chen, M. Kapadia and D. N. Metaxas, "Knowledge As Priors: Cross-Modal Knowledge Generalization for Datasets Without Superior Knowledge," 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 6527-6536, doi: 10.1109/CVPR42600.2020.00656.
[22] A. Spurr, J. Song, S. Park and O. Hilliges, "Cross-Modal Deep Variational Hand Pose Estimation," 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 89-98, doi: 10.1109/CVPR.2018.00017.
[23] B. Doosti, S. Naha, M. Mirbagheri and D. J. Crandall, "HOPE-Net: A Graph-Based Model for Hand-Object Pose Estimation," 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 6607-6616, doi: 10.1109/CVPR42600.2020.00664.
[24] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. "Attention is all you need," In NIPS, pages 5998– 6008, 2017
[25] Y. He, R. Yan, K. Fragkiadaki and S. -I. Yu, "Epipolar Transformers," 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 7776-7785, doi: 10.1109/CVPR42600.2020.00780.
[26] Lin Huang, Jianchao Tan, Ji Liu, and Junsong Yuan. “HandTransformer: Non-autoregressive structured modeling for 3D hand pose estimation,” In ECCV, pages 17–33, 2020
[27] M. Schröder, J. Maycock, H. Ritter and M. Botsch, "Real-time hand tracking using synergistic inverse kinematics," 2014 IEEE International Conference on Robotics and Automation (ICRA), 2014, pp. 5447-5454, doi: 10.1109/ICRA.2014.6907660
[28] S. Fleishman, M. Kliger, A. Lerner and G. Kutliroff, "ICPIK: Inverse Kinematics based articulated-ICP," 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2015, pp. 28-35, doi: 10.1109/CVPRW.2015.7301345
[29] Jonathan Tompson, Murphy Stein, Yann Lecun, and Ken Perlin , "Real-time continuous pose recovery of human hands using convolutional networks," ACM Transactions on Graphics, 33(5):169:1–169:10, Sept. 2014
[30] M. Oberweger, P. Wohlhart and V. Lepetit, "Training a Feedback Loop for Hand Pose Estimation," 2015 IEEE International Conference on Computer Vision (ICCV), 2015, pp. 3316-3324, doi: 10.1109/ICCV.2015.379
[31] M. Oberweger, P. Wohlhart and V. Lepetit, "Generalized Feedback Loop for Joint Hand-Object Pose Estimation," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 8, pp. 1898-1912, 1 Aug. 2020, doi: 10.1109/TPAMI.2019.2907951.
[32] Yidan Zhou, Jian Lu, Kuo Du, Xiangbo Lin, Yi Sun, and Xiaohong Ma. Hbe: Hand branch ensemble network for realtime 3d hand pose estimation. In The European Conference on Computer Vision (ECCV), September 2018
[33] K. Du, X. Lin, Y. Sun and X. Ma, "CrossInfoNet: Multi-Task Information Sharing Based Hand Pose Estimation," 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 9888-9897, doi: 10.1109/CVPR.2019.01013.
[34] J. Malik et al., "DeepHPS: End-to-end Estimation of 3D Hand Pose and Shape by Learning from Synthetic Depth," 2018 International Conference on 3D Vision (3DV), 2018, pp. 110-119, doi: 10.1109/3DV.2018.00023.
[35] L. Ge, Y. Cai, J. Weng and J. Yuan, "Hand PointNet: 3D Hand Pose Estimation Using Point Sets," 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 8417-8426, doi: 10.1109/CVPR.2018.00878
[36] J. Y. Chang, G. Moon and K. M. Lee, "V2V-PoseNet: Voxel-to-Voxel Prediction Network for Accurate 3D Hand and Human Pose Estimation from a Single Depth Map," 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 5079-5088, doi: 10.1109/CVPR.2018.00533
[37] Franziska Mueller, Micah Davis, Florian Bernard, Oleksandr Sotnychenko, Mickeal Verschoor, Miguel A Otaduy, Dan Casas, and Christian Theobalt, " Real-time pose and shape reconstruction of two interacting hands with a single depth camera," ACM Transactions on Graphics (SIGGRAPH), 38(4):49:1–49:13, 2019
[38] Jingjing Shen, Thomas J Cashman, Qi Ye, Tim Hutton, Toby Sharp, Federica Bogo, Andrew Fitzgibbon, and Jamie Shotton, "The phong surface: Efficient 3D model fitting using lifted optimization," In ECCV, pages 687–703, 2020
[39] Lixin Yang, Jiasen Li, Wenqiang Xu, Yiqun Diao, and Cewu Lu, "BiHand: Recovering hand mesh with multi-stage bisected hourglass networks," In BMVC, 2020
[40] A. Boukhayma, R. de Bem and P. H. S. Torr, "3D Hand Shape and Pose From Images in the Wild," 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 10835-10844, doi: 10.1109/CVPR.2019.01110
[41] A. Kanazawa, M. J. Black, D. W. Jacobs and J. Malik, "End-to-End Recovery of Human Shape and Pose," 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 7122-7131, doi: 10.1109/CVPR.2018.00744.
[42] X. Zhang, Q. Li, H. Mo, W. Zhang and W. Zheng, "End-to-End Hand Mesh Recovery From a Monocular RGB Image," 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 2354-2364, doi: 10.1109/ICCV.2019.00244.
[43] S. Baek, K. I. Kim and T. -K. Kim, "Pushing the Envelope for RGB-Based Dense 3D Hand Pose Estimation via Neural Rendering," 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 1067-1076, doi: 10.1109/CVPR.2019.00116.
[44] Yuxiao Zhou, Marc Habermann, Weipeng Xu, Ikhsanul Habibie, Christian Theobalt, and Feng Xu ,"Monocular realtime hand shape and motion capture using multi-modal data," In CVPR, pages 5346–5355, 2020.
[45] Hongsuk Choi, Gyeongsik Moon, and Kyoung Mu Lee,"Pose2Mesh: Graph convolutional network for 3D human pose and mesh recovery from a 2D human pose," In ECCV, pages 769–787, 2020.
[46] Adrian Spurr, Umar Iqbal, Pavlo Molchanov, Otmar Hilliges, and Jan Kautz," Weakly supervised 3D hand pose estimation via biomechanical constraints," In ECCV, pages 211–228, 2020
[47] K. Lin, L. Wang and Z. Liu, "End-to-End Human Pose and Mesh Reconstruction with Transformers," 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 1954-1963, doi: 10.1109/CVPR46437.2021.00199.
[48] Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J Black, “ Smpl: A skinned multi-person linear model,” ACM Transactions on Graphics, 34(6):248, 2015.
[49] G. Pavlakos, L. Zhu, X. Zhou and K. Daniilidis, "Learning to Estimate 3D Human Pose and Shape from a Single Color Image," 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 459-468, doi: 10.1109/CVPR.2018.00055
[50] A. Kanazawa, M. J. Black, D. W. Jacobs and J. Malik, "End-to-End Recovery of Human Shape and Pose," 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 7122-7131, doi: 10.1109/CVPR.2018.00744.
[51] J. Carreira, P. Agrawal, K. Fragkiadaki and J. Malik, "Human Pose Estimation with Iterative Error Feedback," 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 4733-4742, doi: 10.1109/CVPR.2016.512.
[52] X. Tang, T. Wang and C. -W. Fu, "Towards Accurate Alignment in Real-time 3D Hand-Mesh Reconstruction," 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 11678-11687, doi: 10.1109/ICCV48922.2021.01149
[53] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka , "How powerful are graph neural networks? ," In ICLR, 2019
[54] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d′Alch´e-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32, pages 8026–8037. Curran Associates, Inc., 2019
[55] Tieleman, T., & Hinton, G. (2012). Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2), 26-31.
[56] D. Kulon, R. A. Güler, I. Kokkinos, M. M. Bronstein and S. Zafeiriou, "Weakly-Supervised Mesh-Convolutional Hand Reconstruction in the Wild," 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 4989-4999, doi: 10.1109/CVPR42600.2020.00504.
[57] Y. Hasson et al., "Learning Joint Reconstruction of Hands and Manipulated Objects," 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 11799-11808, doi: 10.1109/CVPR.2019.01208.

指導教授

蔡宗漢(Tsung-Han Tsai)

審核日期

2023-3-15

推文