測試時領域自適應與場景感知3D人體姿態重建

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：48

、訪客IP：3.145.110.43

姓名

詹幃傑(WEI-JIE ZHAN) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

測試時領域自適應與場景感知3D人體姿態重建
(Test-time Domain Adaptation And Scene-Aware 3D Human Pose Reconstruction)

相關論文

★ Single and Multi-Label Environmental Sound Recognition with Gaussian Process	★ 波束形成與音訊前處理之嵌入式系統實現
★ 語音合成及語者轉換之應用與設計	★ 基於語意之輿情分析系統
★ 高品質口述系統之設計與應用	★ 深度學習及加速強健特徵之CT影像跟骨骨折辨識及偵測
★ 基於風格向量空間之個性化協同過濾服裝推薦系統	★ RetinaNet應用於人臉偵測
★ 金融商品走勢預測	★ 整合深度學習方法預測年齡以及衰老基因之研究
★ 漢語之端到端語音合成研究	★ 基於 ARM 架構上的 ORB-SLAM2 的應用與改進
★ 基於深度學習之指數股票型基金趨勢預測	★ 探討財經新聞與金融趨勢的相關性
★ 基於卷積神經網路的情緒語音分析	★ 運用深度學習方法預測阿茲海默症惡化與腦中風手術存活

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 ( 永不開放)

摘要(中)

近年對基於深度學習的3D領域發展以極快的速度發展中，科技開始從2D平面的領域擴
展至3D立體的層面。隨著3D研究的發展，已有開始許多點子想利用3D才可呈現的立
體功能，在以往的基礎上進一步地加強畫面的呈現或者應用，例如:已有快速利用人物圖
片生成對應的3D模型，並用於表現出真實人物的動作、姿勢，或者利用3D重建技術，
建立影像的人物及物件。
然而，深度學習的領域中，往往需要大量資料集給予AI模型進行學習，而資料集的
數量和多樣性往往會影響AI模型的後續表現和應用成效，因此在資料集的使用上往往需
要應用各種方法獲取及利用，但這部分往往在3D深度學習領域中更嚴重，不像2D圖像
或者語音已有著大量資料集，3D領域的資料往往較為稀缺，同時由於3D領域在相對於
2D 空間中複雜度較高，僅用單一的2D圖像影像資料往往無法還原實際的3D環境場景，
最常見的問題即是如何的將結果收斂至準確的3D領域中。
為了解決此類的問題，本論文建構了一個利用2D影像建立對應3D物體的作法，利
用多個AI模型進行資料處理，配合有著領域自適應特性的模型，最後利用損失函數進一
步規範生成的結果，使其可以在一定的範圍之內可以生成與現實生活中相似或者近似的結
果

摘要(英)

In recent years, the developmentof3Dtechnologybasedondeeplearninghasbeenprogressingatan
extremely rapid pace, withtechnologyexpandingfrom2Dplanardomainsto3Dspatialdimensions.
As 3D research advances, many ideas have emerged that leverage the unique capabilities of 3D to
enhance visual representation and applications. For example, there are now techniques to quickly
generate corresponding 3D models from human images, which can be used to realistically depict
human movements and poses. Additionally, 3D reconstruction technology can be used to create
images of people and objects.
However, in the field of deep learning, a significant amount of data is often required for AI
models to learn effectively. The quantity and diversity of datasets greatly influence the subsequent
performance and application effectiveness of AI models. This issue is particularly severe in the
realm of 3Ddeeplearning. Unlike 2D images or audio, where there are abundant datasets available,
3D data is often scarce. Due to the higher complexity of 3D spaces compared to 2D, a single 2D
imageisusuallyinsufficienttoaccuratelyreconstructtheactual3Denvironment. Themostcommon
challenge is how to converge the results to an accurate 3D domain.
To address these issues, this paper constructs a method to establish corresponding 3D objects
from 2D images. It utilizes multiple AI models for data processing and incorporates models with
domain adaptation capabilities. Finally, it employs a loss function to further constrain the generated
results, ensuring that the generated outputs are similar or approximate to real-life objects within a
certain range

關鍵字(中)

★ 3D 人體姿態估測

關鍵字(英)

★ 3D Human Pose Estimation

論文目次

摘要. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . II
目錄. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IV
圖目錄. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VII
表目錄. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IX
第一章緒論. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1研究背景與目的. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2研究方法與章節概要. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
第二章3D環境和物體. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1 3D物體建構. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 3D物體至圖像輸出. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.1 3D建模. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.2材質. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.3動畫. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.4渲染. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 3D人體模型. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.1 SMPL-X . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.2 GHUM和GHUML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
第三章深度學習. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1卷積神經網路. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1.1卷積層. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1.2池化層. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1.3全連接層. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 ResidualNetwork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3領域自適應. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3.1測試時領域自適應. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
第四章相關文獻. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.1 3D建構相關文獻. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.1.1 SignedDistanceField . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.1.2神經輻射場. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.1.3 3DGaussiansplatting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.2 3D人體姿態與SMPL相關文獻. . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.3 3D軌跡重建相關文獻. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
第五章方法. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.1資料前處理. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.1.1 AlphaPose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.1.2 DPT(VisionTransformersforDensePrediction) . . . . . . . . . . . . . . 31
5.1.3 Mask2Former . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.1.4 ROMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.2 Phase1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.2.1 CycleAdapt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.2.2損失函數. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.3 Phase2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.3.1損失函數. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
第六章實驗. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.1實驗設備及環境. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.2資料集. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.3實驗結果. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.4實驗結果分析與探討. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
第七章結論及未來展望. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
第八章參考文獻. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

參考文獻

[1]Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J. Black.
”SMPL: A Skinned Multi-Person Linear Model”Seminal Graphics Papers: Pushing the Bound
aries, Volume 2 (1st ed.). Association for Computing Machinery, New York, NY, USA, Article 88,
pp.851–866,Aug. 2023.
[2]Ben Mildenhall et al.,“NeRF: Representing Scenes as Neural Radiance Fields for View Synthe
sis”arXiv:2003.08934,Aug ,2020.
[3]Bernhard Kerbl, Georgios Kopanas, Thomas Leimkuehler, and George Drettakis.” 3D Gaussian
Splatting for Real-TimeRadianceFieldRendering”. ACMTransactionsonGraphics,volume42(4),
Jul.2023.
[4]G. Pavlakos, X. Zhou and K. Daniilidis”Ordinal Depth Supervision for 3D Human Pose Estima
tion,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt
Lake City, UT, USA,pp. 7307-7316.Jun. 2018.
[5]H. Nam,D.S.Jung,Y.OhandK.M.Lee,“CyclicTest-TimeAdaptationonMonocularVideofor
3DHumanMeshReconstruction,”2023 IEEE/CVF International Conference on Computer Vision
(ICCV), Paris, France,pp. 14783-14793.Oct.2023.
[6]Diogo Luvizon et al.,“Scene-Aware 3D Multi-Human Motion Capture from a Single Camera”
arXiv:2301.05175.Mar,2023.
[7]Tianye Li, Timo Bolkart, Michael J. Black, Hao Li, and Javier Romero. 2017. “Learning a
model of facial shape and expression from 4D scans.”ACM Trans. Graph. 36, 6, Article 194, 17
pages.Dec.2017
[8]Yao Feng, Haiwen Feng, Michael J. Black, and Timo Bolkart. 2021.“Learning an animatable
detailed 3D face model from in-the-wild images. ”ACM Trans. Graph. 40, 4, Article 88, 13 pages.Aug.2021.
[9]Daněček, R., Black, M.J., Bolkart, T. (2022).“EMOCA: Emotion Driven Monocular Face Cap
ture and Animation. ”2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR),pp. 20279-20290.Jun.2022.
[10]H. Xu, E. G. Bazavan, A. Zanfir, W. T. Freeman, R. Sukthankar and C. Sminchisescu, ”GHUM
GHUML: Generative 3D Human Shape and Articulated Pose Models,” 2020 IEEE/CVF Confer
ence on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 2020, pp. 6183
6192.June.2020.
[11]Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed A. A. Osman,
Dimitrios Tzionas, Michael J. Black, ”Expressive BodyCapture: 3DHands, Face, andBodyFroma
Single Image,” 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
Long Beach, CA, USA, 2019, pp. 10967-10977.Jun.2019.
[12]Javier Romero, Dimitrios Tzionas, and Michael J. Black. 2017.“Embodied hands: mod
eling and capturing hands and bodies together.”ACM Trans. Graph. 36, 6, Article 245, 17
pages.Dec.2017.
[13]A. Boukhayma, R. de Bem and P. H. S. Torr, ”3D Hand Shape and Pose From Images in the
Wild,” 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long
Beach, CA, USA, 2019, pp. 10835-10844, Jun.2019.
[14]Fangzhou Honget al.,“EVA3D: Compositional 3D Human Generation from 2D Image Collec
tions”arXiv:2210.04888.Oct ,2022.
[15]C. Patel, Z. Liao and G. Pons-Moll, ”TailorNet: Predicting Clothing in 3D as a Function of
Human Pose, Shape and Garment Style,” in 2020 IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR), Seattle, WA, USA, 2020 pp. 7363-7373.Jun.2020.
[16]Sahib Majithia, Sandeep N. Parameswaran, Sadbhavana Babar, Vikram Garg, Astitva Srivas
tava, Avinash Sharma, ”Robust 3D Garment Digitization from Monocular 2D Images for 3D Vir
tual Try-On Systems,” in 2022 IEEE/CVF Winter Conference on Applications of Computer Vision
(WACV), Waikoloa, HI, USA, 2022 pp. 1411-1421.Jan.2022.
[17]C. Guo, T. Jiang, X. Chen, J. Song and O. Hilliges, ”Vid2Avatar: 3D Avatar Reconstruction
fromVideosintheWildviaSelf-supervised SceneDecomposition,”in2023IEEE/CVFConference
on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 2023 pp. 12858
12868.Jun.2023.
[18]A. Vaswani et al., “Attention Is All You Need.”arXiv:1706.03762.Dec.2017 .
[19]Yuxuan Wang i et al.,“Toward end-to-end speech synthesis”arXiv:1703.10135.Apr 2017.
[20]Karen Simonyan, Andrew Zisserman,“Very Deep Convolutional Networks for Large-Scale
Image Recognition”arxiv:1409.1556,Apr.2015.
[21]Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov,
Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich, ”Going deeper with convolutions,” 2015
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA,pp.
1-9,Jun.2015.
[22]Y. Lecun, L. Bottou, Y. Bengio and P. Haffner, ”Gradient-based learning applied to document
recognition,” in Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324, Nov. 1998.
[23]K. He, X. Zhang, S. Ren and J. Sun, ”Deep Residual Learning for Image Recognition,” 2016
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA,
2016, pp. 770-778.Jun.2016.
[24]J. Park, P. Florence, J. Straub, R. Newcombe and S. Lovegrove, ”DeepSDF: Learning Continu
ous Signed Distance Functions for Shape Representation,” in 2019 IEEE/CVF Conference on Com
puter Vision and Pattern Recognition (CVPR), Long Beach, CA, USA,2019pp. 165-174.Jun.2019.
[25]M. Li, Y. Duan, J. Zhou and J. Lu, ”Diffusion-SDF: Text-to-Shape via Voxelized Diffusion,”
in 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver,
BC, Canada, 2023 pp. 12642-12651.Jun.2023.
[26]Ben Mildenhall et al.,“NeRF: Representing Scenes as Neural Radiance Fields for View Syn
thesis”arXiv:2003.08934.Aug ,2020.
[27]ShengjieMaetal.,“3DGaussianBlendshapesforHeadAvatarAnimation”arXiv:2404.19398.May.2024.
[28]Jiefeng Li et al.,“HybrIK: Hybrid Analytical-Neural Inverse Kinematics for Body Mesh Recovery”arXiv:2304.05690.Apr.2023.
[29]Mingyi Shi et al.,“MotioNet: 3D Human Motion Reconstruction from Monocular Video with
Skeleton Consistency”arXiv:2006.12075.Jun.2020.
[30]Z. Tang, Z. Qiu, Y. Hao, R. Hong and T. Yao, ”3D Human Pose Estimation with Spatio
Temporal Criss-Cross Attention,” 2023 IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), Vancouver, BC, Canada, 2023, pp. 4790-4799.Jun.2023.
[31]X. Zhou, Q. Huang, X. Sun, X. Xue and Y. Wei, ”Towards 3D Human Pose Estimation in the
Wild: A Weakly-Supervised Approach,” 2017 IEEE International Conference on Computer Vision
(ICCV), Venice, Italy, 2017, pp. 398-407.Oct.2017.
[32]Dushyant Mehta et al.,“VNect: Real-time 3D Human Pose Estimation with a Single RGB
Camera”arXiv:1705.01583.May.2017.
[33]D. Azinovic, R. Martin-Brualla, D. Goldman, M. Niebner and J. Thies, ”Neural RGB-D Sur
face Reconstruction,” in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR), New Orleans, LA, USA, 2022 pp. 6280-6291.Jun.2022.
[34]Soyong Shin et al.,“WHAM: Reconstructing World-grounded Humans with Accurate 3D Mo
tion”arXiv:2312.07531.Apr.2024.
[35]Y. Sun, Q. Bao, W. Liu, T. Mei and M. J. Black, ”TRACE: 5D Temporal Regression of Avatars
with Dynamic Cameras in 3D Environments,” 2023 IEEE/CVF Conference on Computer Vision
and Pattern Recognition (CVPR), Vancouver, BC, Canada, 2023, pp. 8856-8866.Jun.2023.
[36]E. Gartner, M. Andriluka, H. Xu and C. Sminchisescu, ”Trajectory Optimization for Physics
Based Reconstruction of 3d Human Pose from Monocular Video,” in 2022 IEEE/CVF Conference
on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 2022 pp. 13096
13105.Jun.2022.
[37]Nikhila Ravi et al.,“Accelerating 3d deep learning with pytorch3d”arXiv:2007.08501.Jul
,2022.
[38]H.-S.Fangetal., ”AlphaPose: Whole-BodyRegionalMulti-PersonPoseEstimationandTrack
ing in Real-Time,” in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no.6, pp. 7157-7173, 1.Jun. 2023.
[39]R. Ranftl, A. Bochkovskiy and V. Koltun, ”Vision Transformers for Dense Prediction,” 2021
IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 2021,
pp. 12159-12168.Oct.2021.
[40]B. Cheng, I. Misra, A. G. Schwing, A. Kirillov and R. Girdhar, ”Masked-attention Mask Trans
former for Universal Image Segmentation,” 2022 IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR), New Orleans, LA, USA, 2022, pp. 1280-1289.Jun.2022.
[41]Y. Sun, Q. Bao, W. Liu, Y. Fu, M. J. Black and T. Mei, ”Monocular, One-stage, Regression
of Multiple 3D People,” 2021 IEEE/CVF International Conference on Computer Vision (ICCV),
Montreal, QC, Canada, 2021, pp. 11159-11168.Oct.2021.
[42]Géry Casiez, Nicolas Roussel, and Daniel Vogel. 2012. 1 € filter: a simple speed-based low
pass filter for noisy input in interactive systems. In Proceedings of the SIGCHI Conference on
Human Factors in Computing Systems (CHI ’12). Association for Computing Machinery, New
York, NY, USA,pp. 2527–2530.May 2012.
[43]D. Mehta et al., ”Single-Shot Multi-person 3D Pose Estimation from Monocular RGB,” 2018
International Conference on 3D Vision (3DV), Verona, Italy, 2018, pp. 120-130.Sep.2018.

指導教授

王家慶(Jia-Ching Wang)

審核日期

2024-8-14

推文