應用生成對抗網路於人體姿態映射與全身風格轉換之演算法

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：101

、訪客IP：18.118.162.180

姓名

陳逸星(Yi-Hsin Chen) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

應用生成對抗網路於人體姿態映射與全身風格轉換之演算法
(A Generative Adversarial Network-based Framework for Human Pose Mapping and Full Body Style Transformation)

相關論文

★ 以Q-學習法為基礎之群體智慧演算法及其應用	★ 發展遲緩兒童之復健系統研製
★ 從認知風格角度比較教師評量與同儕互評之差異：從英語寫作到遊戲製作	★ 基於檢驗數值的糖尿病腎病變預測模型
★ 模糊類神經網路為架構之遙測影像分類器設計	★ 複合式群聚演算法
★ 身心障礙者輔具之研製	★ 指紋分類器之研究
★ 背光影像補償及色彩減量之研究	★ 類神經網路於營利事業所得稅選案之應用
★ 一個新的線上學習系統及其於稅務選案上之應用	★ 人眼追蹤系統及其於人機介面之應用
★ 結合群體智慧與自我組織映射圖的資料視覺化研究	★ 追瞳系統之研發於身障者之人機介面應用
★ 以類免疫系統為基礎之線上學習類神經模糊系統及其應用	★ 基因演算法於語音聲紋解攪拌之應用

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

在過去，對影像中人物的姿勢與動作進行轉換的這項工作，
需要仰賴許多影像特效師花費大量時間進行後製。
傳統的方法像是使用3D環繞攝影機來捕捉人物的動作，
同時建立一個3D的動畫模型去對應人物的各個支點。
隨著科技的演進，人們可以使用像是生成對抗網路(Generative Adversarial Network)
或是其餘深度學習之神經網路來幫助生成這些圖像。
在生成圖像的同時，為了能夠捕捉人物的細節材質等，
這些深度學習的方法常使用人物的骨架、立體的網格、
身體各部位的語意分割或是使用UV座標來幫助捕捉這些細節。

本論文將提出一個基於生成對抗網路的演算法，
能夠重新生成一個人的各項細節至特定的姿態。
本研究的演算法包含(1)使用Pix2pix網路來將圖像從骨架圖片轉至對應UV座標圖片，
(2)將人物的輪廓、UV座標圖片、以及原始圖片當作輸入，使用基於StyleGAN的網路來生成人物的圖像至指定姿態。
而根據本論文的實驗，本研究在使用骨架生成UV圖片的SSIM有0.932，
而在姿態與風格轉換上的SSIM有0.7524，
因此來證明本論文提供之演算法有一定程度之可用性。

摘要(英)

In the past, pose re-rendering relied on skilled visual effects artists and time-consuming post-production.
Traditional methods such as building 3D camera arrays to capture a human′s pose
and build human keypoints to fit the animation model.
Nowadays people use learning-based tools to generate images such as GAN(Generative Adversarial Network)s or other neural network frameworks.
In order to capture human appearance,
these methods tend to use skeleton, mesh, body part segmentation or dense UV coordinates to capture fine appearance details.

In this paper, we present a framework that could re-render a person from a single source image to a specific pose.
Our framework includes (1) using Pix2pix network to generate UV coordinates image from a keypoint skeleton image.
(2) Take human foreground mask, UV coordinate image and original images as input,
use StyleGAN network to translate a person from source to target image.

According to the results of the experiments,
the results of our skeleton keypoints to the UV coordinate model shows 0.932 on SSIM.
And the results of our pose rerendering model shows 0.7524 on SSIM.
Therefore, our framework has a certain degree of usability.

關鍵字(中)

★ 風格轉換
★ 深度學習
★ 生成對抗網路
★ 影像處理
★ 電腦視覺

關鍵字(英)

★ Style Transfer
★ Deep Learning
★ Generative Adversarial Network
★ Image Processing
★ Computer Vision

論文目次

一、緒論 1
1.1 研究動機 .................................................................. 1
1.2 研究目的 .................................................................. 2
1.3 論文架構 .................................................................. 2
二、背景知識以及文獻回顧 3
2.1 背景知識 .................................................................. 3
2.1.1 人體骨架偵測 ................................................... 3
2.1.2 UV 座標與 DensePose ......................................... 4
2.1.3 Mask R-CNN..................................................... 6
2.2 生成對抗網路 (Generative Adversarial Network) .................. 8
2.2.1 Conditional GAN ................................................ 8
2.2.2 Pix2pix Conditional GAN...................................... 9
2.2.3 StyleGAN......................................................... 10
2.2.4 PoseWithStyleGAN ............................................. 12
2.3 文獻回顧 .................................................................. 14
2.3.1 人物姿態轉換之相關研究 .................................... 14
2.3.2 StyleGAN 之相關研究......................................... 14
三、研究方法 17
3.1 演算法流程 ............................................................... 17
3.2 IUV 生成模型 ............................................................ 18
3.2.1 資料前處理 ...................................................... 19
3.2.2 訓練方法與網路架構 .......................................... 21
3.2.3 目標函數 ......................................................... 22
3.3 姿態與風格轉換模型 ................................................... 22
3.3.1 資料前處理 ...................................................... 23
3.3.2 訓練方法與網路架構 .......................................... 23
四、實驗設計與結果 25
4.1 IUV 生成實驗與評估 ................................................... 25
4.1.1 IUV 生成實驗之資料集描述 ................................. 25
4.1.2 IUV 生成實驗設計 ............................................. 26
4.1.3 不同輸入實驗結果與分析 .................................... 27
4.1.4 不同關節點模型與批次實驗結果與分析 .................. 28
4.1.5 IUV 模型生成結果比較與效能分析 ........................ 28
4.2 姿態與風格轉換生成實驗與評估 .................................... 30
4.2.1 StyleGAN 生成效能探討實驗................................ 30
4.2.2 姿態與風格轉換生成結果實驗與分析 ..................... 31
4.2.3 人物位置移動及轉身實驗與分析 ........................... 33
4.2.4 相關研究之比對和分析 ....................................... 36
4.3 模型相關應用與限制之探討 .......................................... 38
4.3.1 服裝風格轉換 ................................................... 38
4.3.2 模型應用之限制 ................................................ 38
五、總結 40
5.1 結論 ........................................................................ 40
5.2 未來展望 .................................................................. 40
參考文獻 42

參考文獻

[1] I. Goodfellow, J. Pouget-Abadie, M. Mirza, et al., “Generative Adversarial Nets,” in
Advances in Neural Information Processing Systems, Z. Ghahramani, M. Welling, C.
Cortes, N. Lawrence, and K. Q. Weinberger, Eds., vol. 27, Curran Associates, Inc., 2014.
[2] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks,” in 2017 IEEE International Conference on
Computer Vision (ICCV), Oct. 2017, pp. 2242–2251.
[3] T. Karras, T. Aila, S. Laine, and J. Lehtinen, “Progressive Growing of GANs for Improved Quality, Stability, and Variation,” in International Conference on Learning Representations, Feb. 2018.
[4] T. Karras, S. Laine, and T. Aila, “A Style-Based Generator Architecture for Generative
Adversarial Networks,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), Jun. 2019, pp. 4396–4405.
[5] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila, “Analyzing and Improving the Image Quality of StyleGAN,” in 2020 IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR), Jun. 2020, pp. 8107–8116.
[6] T. Karras, M. Aittala, J. Hellsten, S. Laine, J. Lehtinen, and T. Aila, “Training Generative Adversarial Networks with Limited Data,” in Thirty-Fourth Conference on Neural
Information Processing Systems, 2020.
[7] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” in Proceedings of the IEEE conference on computer vision
and pattern recognition, 2017, pp. 1125–1134.
[8] B. Albahar, J. Lu, J. Yang, Z. Shu, E. Shechtman, and J.-B. Huang, “Pose with style:
Detail-preserving pose-guided image synthesis with conditional stylegan,” ACM Transactions on Graphics (TOG), vol. 40, no. 6, pp. 1–11, 2021.
[9] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime Multi-person 2D Pose Estimation
Using Part Affinity Fields,” in 2017 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), Jul. 2017, pp. 1302–1310.
[10] Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei, and Y. Sheikh, “OpenPose: Realtime MultiPerson 2D Pose Estimation Using Part Affinity Fields,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 43, no. 1, pp. 172–186, Jan. 2019.
[11] H. Fang, S. Xie, Y.-W. Tai, and C. Lu, “RMPE: Regional Multi-person Pose Estimation,”
2017 IEEE International Conference on Computer Vision (ICCV), 2017.
[12] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in 2017 IEEE International Conference on Computer Vision (ICCV), Oct. 2017, pp. 2980–2988.
[13] Wikipedia. “UV mapping.” (Jun. 23, 2022), [Online]. Available: https://en.wikipedia.
org/wiki/UV_mapping (visited on 07/04/2022).
[14] R. A. Güler, N. Neverova, and I. Kokkinos, “DensePose: Dense Human Pose Estimation
in the Wild,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2018, pp. 7297–7306.
[15] R. A. Güler, G. Trigeorgis, E. Antonakos, P. Snape, S. Zafeiriou, and I. Kokkinos, “DenseReg:
Fully Convolutional Dense Shape Regression In-the-Wild,” in 2017 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), Jul. 2017, pp. 2614–2623.
[16] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, vol. 39, no. 6, pp. 1137–1149, Jun. 2017.
[17] N.-C. Lee et al., “應用生成對抗網路於嬰兒骨架偵測與早產兒整體動作指標分析,”
M.S. thesis, National Central University, 2020.
[18] M. Mirza and S. Osindero, “Conditional generative adversarial nets,” arXiv preprint
arXiv:1411.1784, 2014.
[19] C. Chan, S. Ginosar, T. Zhou, and A. A. Efros, “Everybody dance now,” in Proceedings
of the IEEE/CVF international conference on computer vision, 2019, pp. 5933–5942.
[20] L. Ma, X. Jia, Q. Sun, B. Schiele, T. Tuytelaars, and L. Van Gool, “Pose guided person
image generation,” Advances in neural information processing systems, vol. 30, 2017.
[21] G. Yildirim, N. Jetchev, R. Vollgraf, and U. Bergmann, “Generating high-resolution fashion model images wearing custom outfits,” in Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, 2019, pp. 0–0.
[22] Y. Men, Y. Mao, Y. Jiang, W.-Y. Ma, and Z. Lian, “Controllable person image synthesis
with attribute-decomposed gan,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 5084–5093.
[23] K. Sarkar, V. Golyanik, L. Liu, and C. Theobalt, “Style and pose control for image synthesis of humans from a single monocular view,” arXiv preprint arXiv:2102.11263, 2021.
[24] E. Lu, F. Cole, T. Dekel, et al., “Layered neural rendering for retiming people in video,”
arXiv preprint arXiv:2009.07833, 2020.43
[25] R. Abdal, Y. Qin, and P. Wonka, “Image2stylegan: How to embed images into the stylegan latent space?” In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 4432–4441.
[26] A. Tewari, M. Elgharib, G. Bharaj, et al., “Stylerig: Rigging stylegan for 3d control over
portrait images,” in Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, 2020, pp. 6142–6151.
[27] B. Egger, W. A. Smith, A. Tewari, et al., “3d morphable face models—past, present, and
future,” ACM Transactions on Graphics (TOG), vol. 39, no. 5, pp. 1–38, 2020.
[28] D. Castro, S. Hickson, P. Sangkloy, et al., “Let’s dance: Learning from online dance
videos,” arXiv preprint arXiv:1801.07388, 2018.
[29] M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele, “2d human pose estimation: New
benchmark and state of the art analysis,” in Proceedings of the IEEE Conference on
computer Vision and Pattern Recognition, 2014, pp. 3686–3693.
[30] Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang, “Deepfashion: Powering robust clothes
recognition and retrieval with rich annotations,” in Proceedings of the IEEE conference
on computer vision and pattern recognition, 2016, pp. 1096–1104.
[31] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and
computer-assisted intervention, Springer, 2015, pp. 234–241.
[32] P. Zablotskaia, A. Siarohin, B. Zhao, and L. Sigal, “Dwnet: Dense warp-based network
for pose-guided human video generation,” arXiv preprint arXiv:1910.09139, 2019.
[33] L. Ma, Q. Sun, S. Georgoulis, L. Van Gool, B. Schiele, and M. Fritz, “Disentangled
person image generation,” in Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, 2018, pp. 99–108.

指導教授

蘇木春(Mu-Chun Su)

審核日期

2022-8-15

推文