基於雙重注意力機制與人臉強化機制 之人體姿態遷移

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：135

、訪客IP：3.12.163.124

姓名

江俊辰(Chun-Chen Chiang) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

基於雙重注意力機制與人臉強化機制之人體姿態遷移
(Enhancing Human Pose Transfer with Attention Mechanisms, Convolutional Block Attention Module and Facial Loss Optimization)

相關論文

★ 影片指定對象臉部置換系統	★ 以單一攝影機實現單指虛擬鍵盤之功能
★ 基於視覺的手寫軌跡注音符號組合辨識系統	★ 利用動態貝氏網路在空照影像中進行車輛偵測
★ 以視訊為基礎之手寫簽名認證	★ 使用膚色與陰影機率高斯混合模型之移動膚色區域偵測
★ 影像中賦予信任等級的群眾切割	★ 航空監控影像之區域切割與分類
★ 在群體人數估計應用中使用不同特徵與回歸方法之分析比較	★ 以視覺為基礎之強韌多指尖偵測與人機介面應用
★ 在夜間受雨滴汙染鏡頭所拍攝的影片下之車流量估計	★ 影像特徵點匹配應用於景點影像檢索
★ 自動感興趣區域切割及遠距交通影像中的軌跡分析	★ 基於回歸模型與利用全天空影像特徵和歷史資訊之短期日射量預測
★ Analysis of the Performance of Different Classifiers for Cloud Detection Application	★ 全天空影像之雲追蹤與太陽遮蔽預測

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 (2029-7-2以後開放)

摘要(中)

在合成物體和場景的領域中，有許多相關技術可以適用於計算機圖形學、圖像重建、攝影以及視覺數據的生成。在合成視角時，我們經常遇到遮擋、照明變化和幾何失真等挑戰。當處理可變形物體，例如人類時，這些問題尤為突出。這些因素顯著增加了視角合成的複雜性。
而在現代社會中，運動和舞蹈不僅有助於提升身體健康和生活品質，也是展現個人魅力和藝術表現的途徑。對非專業人士來說，有效率地在閒暇時間提升技能是一大挑戰。深度學習中的姿態轉換技術，是將一人的動作姿態轉移到提供的參考動作上，提供了一種創新解決方案。這技術讓老師與學員能直觀比較動作差異，即使在無人指導的情況下，也能有效學習和修正動作。本篇論文提供一個姿態轉換系統，藉由使用者提供參考圖片與選擇本系統提供之姿態，讓系統自動生成出相關動作的人物圖片，並且可以提供使用者下載成影片於本地端。
在架構上，我們以Multi-scale Attention Guided Pose Transfer(MAGPT)模型為基礎，修改其中Residual Block，對其加入Convolutional Block Attention Module (CBAM) 並且將激活函數從Relu改為Mish以獲得更多關於衣服與人物膚色相關等特徵，並且對於原架構生成之圖片臉部特徵與原圖相比有所差異，對於此問題，我們提出兩種不同臉部特徵的損失函數可以分別幫助模型學到更精確的圖片特徵。最後，基於本系統的架構下，我們只要使用一張參考圖片，就可以讓使用者轉換成不同的動作影片。

摘要(英)

In the field of synthesizing objects and scenes, many related techniques can be applied to computer graphics, image reconstruction, photography, and the generation of visual data. When synthesizing perspectives, we often encounter challenges such as occlusion, lighting changes, and geometric distortions. These issues are particularly pronounced when dealing with deformable objects, such as humans. These factors significantly increase the complexity of perspective synthesis.
In modern society, sports and dance not only contribute to physical health and quality of life but also serve as avenues for personal charm and artistic expression. For non-professionals, efficiently improving skills during leisure time poses a significant challenge. Pose transfer technology in deep learning, which transfers the motion and posture of one individual onto a provided reference movement, offers an innovative solution. This technology enables coaches and students to intuitively compare movement differences, allowing effective learning and correction of actions even without the presence of a coach. This paper presents a pose transfer system that generates related action images automatically by using reference images provided by users and selecting poses offered by the system, and it also allows users to download the videos locally.
In terms of architecture, our model is based on the Multi-scale Attention Guided Pose Transfer (MAGPT) model, with modifications to its Residual Block by integrating the Convolutional Block Attention Module (CBAM) and changing the activation function from Relu to Mish to capture more features related to clothing and skin color. Additionally, as the generated images had facial features differing from the original image, we propose two different facial feature loss functions to help the model learn more precise image features. Ultimately, with our system′s architecture, just one reference image is required to enable users to transform into different action videos.

關鍵字(中)

★ 姿態轉換
★ 生成對抗網路

關鍵字(英)

★ Pose Transfer
★ Generative Adversarial Network

論文目次

摘要 i
Abstract ii
圖目錄 v
表目錄 vi
第一章緒論 1
1.1 研究背景與動機 1
1.2 相關文獻 2
1.3 系統架構 3
1.4 論文架構 4
第二章文獻回顧 6
2.1 DeepFashion資料集 6
2.2 VGG-19 網路模型 7
2.3 SqueezeNet 網路模型 8
2.4 生成對抗網路 9
2.5 ADGAN 10
2.6 openpose 11
2.7 PatchGAN discriminator 12
第三章研究方法 14
3.1 資料集 14
3.2 模型架構 15
3.2.1 生成模型架構 15
3.2.2 鑑別模型架構 24
3.3 損失函數 25
3.3.1 生成器損失函數 25
3.3.2 鑑別器損失函數 28
第四章實驗結果 30
4.1 設備環境設定 30
4.2 資料集 30
4.3 驗證指標 31
4.3.1 Structural Similarity Index (SSIM) 31
4.3.2 Inception Score (IS) 31
4.3.3 Fréchet Inception Distance（FID） 32
4.3.4 SSD: Single Shot MultiBox Detector Score (DS) 33
4.3.5 PCKh 34
4.3.6 Perceptual Image Patch Similarity (LPIPS) 35
4.4 完整模型之實驗比較結果 35
4.4.1 定性實驗結果 36
4.4.2 定量實驗結果 38
4.4.3 實驗結果分析 38
4.5 消融實驗(Ablation Experiments) 40
4.5.1 加入CBAM之影響 40
4.5.2 激活函數改為Mish之影響 42
4.5.3 加入Head Region Loss之影響 43
4.5.4 加入Face Focused Loss之影響 45
4.6應用 47
4.6.1 應用資料集 47
4.6.2 自定義動作資料集轉換結果 49
第五章結論與未來研究方向 53
參考文獻 54

參考文獻

[1] I. Goodfellow et al., "Generative adversarial nets," Advances in neural information processing systems, vol. 27, 2014.
[2] J. Johnson, A. Alahi, and L. Fei-Fei, "Perceptual losses for real-time style transfer and super-resolution," in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, 2016: Springer, pp. 694-711.
[3] C. Lassner, G. Pons-Moll, and P. V. Gehler, "A generative model of people in clothing," in Proceedings of the IEEE international conference on computer vision, 2017, pp. 853-862.
[4] C. Ledig et al., "Photo-realistic single image super-resolution using a generative adversarial network," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4681-4690.
[5] M. Mirza and S. Osindero, "Conditional generative adversarial nets," arXiv preprint arXiv:1411.1784, 2014.
[6] A. Radford, L. Metz, and S. Chintala, "Unsupervised representation learning with deep convolutional generative adversarial networks," arXiv preprint arXiv:1511.06434, 2015.
[7] G. Balakrishnan, A. Zhao, A. V. Dalca, F. Durand, and J. Guttag, "Synthesizing images of humans in unseen poses," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 8340-8348.
[8] C. Chan, S. Ginosar, T. Zhou, and A. A. Efros, "Everybody dance now," in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 5933-5942.
[9] L. Ma, X. Jia, Q. Sun, B. Schiele, T. Tuytelaars, and L. Van Gool, "Pose guided person image generation," Advances in neural information processing systems, vol. 30, 2017.
[10] L. Ma, Q. Sun, S. Georgoulis, L. Van Gool, B. Schiele, and M. Fritz, "Disentangled person image generation," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 99-108.
[11] N. Neverova, R. A. Guler, and I. Kokkinos, "Dense pose transfer," in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 123-138.
[12] C. Si, W. Wang, L. Wang, and T. Tan, "Multistage adversarial losses for pose-based human image synthesis," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 118-126.
[13] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, "Image-to-image translation with conditional adversarial networks," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1125-1134.
[14] P. Sangkloy, J. Lu, C. Fang, F. Yu, and J. Hays, "Scribbler: Controlling deep image synthesis with sketch and color," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 5400-5409.
[15] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, "Unpaired image-to-image translation using cycle-consistent adversarial networks," in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2223-2232.
[16] R. A. Yeh, C. Chen, and T. Y. Lim, "Schwing Alexander G., Mark Hasegawa-Johnson, and Minh N. Do. Semantic image inpainting with deep generative models," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
[17] C. Dong, C. C. Loy, K. He, and X. Tang, "Image super-resolution using deep convolutional networks," IEEE transactions on pattern analysis and machine intelligence, vol. 38, no. 2, pp. 295-307, 2015.
[18] J. Kim, J. K. Lee, and K. M. Lee, "Accurate image super-resolution using very deep convolutional networks," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1646-1654.
[19] D. P. Kingma and M. Welling, "Auto-encoding variational bayes," arXiv preprint arXiv:1312.6114, 2013.
[20] B. Zhao, X. Wu, Z.-Q. Cheng, H. Liu, Z. Jie, and J. Feng, "Multi-view image generation from a single-view," in Proceedings of the 26th ACM international conference on Multimedia, 2018, pp. 383-391.
[21] A. Siarohin, E. Sangineto, S. Lathuiliere, and N. Sebe, "Deformable gans for pose-based human image generation," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 3408-3416.
[22] O. Ronneberger, P. Fischer, and T. Brox, "U-net: Convolutional networks for biomedical image segmentation," in Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, 2015: Springer, pp. 234-241.
[23] P. Esser, E. Sutter, and B. Ommer, "A variational u-net for conditional appearance and shape generation," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 8857-8866.
[24] Z. Zhu, T. Huang, B. Shi, M. Yu, B. Wang, and X. Bai, "Progressive pose attention transfer for person image generation," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 2347-2356.
[25] P. Roy, S. Bhattacharya, S. Ghosh, and U. Pal, "Multi-scale attention guided pose transfer," Pattern Recognition, vol. 137, p. 109315, 2023.
[26] Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang, "Deepfashion: Powering robust clothes recognition and retrieval with rich annotations," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1096-1104.
[27] K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," arXiv preprint arXiv:1409.1556, 2014.
[28] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer, "SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size," arXiv preprint arXiv:1602.07360, 2016.
[29] Y. Men, Y. Mao, Y. Jiang, W.-Y. Ma, and Z. Lian, "Controllable person image synthesis with attribute-decomposed gan," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 5084-5093.
[30] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, "Realtime multi-person 2d pose estimation using part affinity fields," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 7291-7299.
[31] K. Hebert-Losier, I. Hanzlikova, C. Zheng, L. Streeter, and M. Mayo, "The ‘DEEP’landing error scoring system," Applied Sciences, vol. 10, no. 3, p. 892, 2020.
[32] A. Newell, K. Yang, and J. Deng, "Stacked hourglass networks for human pose estimation," in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII 14, 2016: Springer, pp. 483-499.
[33] S. Woo, J. Park, J.-Y. Lee, and I. S. Kweon, "Cbam: Convolutional block attention module," in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 3-19.
[34] D. Misra, "Mish: A self regularized non-monotonic activation function," arXiv preprint arXiv:1908.08681, 2019.
[35] N. V. Keetha and C. S. R. Annavarapu, "U-Det: A modified U-Net architecture with bidirectional feature network for lung nodule segmentation," arXiv preprint arXiv:2003.09293, 2020.
[36] T. Szandała, "Review and comparison of commonly used activation functions for deep neural networks," Bio-inspired neurocomputing, pp. 203-224, 2021.
[37] J. Terven, D. M. Cordova-Esparza, A. Ramirez-Pedraza, and E. A. Chavez-Urbiola, "Loss functions and metrics in deep learning. A review," arXiv preprint arXiv:2307.02694, 2023.
[38] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, "Image quality assessment: from error visibility to structural similarity," IEEE transactions on image processing, vol. 13, no. 4, pp. 600-612, 2004.
[39] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen, "Improved techniques for training gans," Advances in neural information processing systems, vol. 29, 2016.
[40] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, "Gans trained by a two time-scale update rule converge to a local nash equilibrium," Advances in neural information processing systems, vol. 30, 2017.
[41] W. Liu et al., "Ssd: Single shot multibox detector," in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14, 2016: Springer, pp. 21-37.
[42] S. Ren, K. He, R. Girshick, and J. Sun, "Faster r-cnn: Towards real-time object detection with region proposal networks," Advances in neural information processing systems, vol. 28, 2015.
[43] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, "The unreasonable effectiveness of deep features as a perceptual metric," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 586-595.
[44] G. Jocher. "yolov8." https://docs.ultralytics.com/ , accessed June 24, 2024.

指導教授

鄭旭詠(HSU-YUNG CHENG)

審核日期

2024-7-11

推文