基於多尺度特徵與控制網路的潛在擴散模型達到姿態轉換任務

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：79

、訪客IP：3.128.172.154

姓名

蘇嘉成(Su Chia Cheng) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

基於多尺度特徵與控制網路的潛在擴散模型達到姿態轉換任務
(Pose Transfer with Multi-Scale Features Combined with Latent Diffusion Model and ControlNet)

相關論文

★ 影片指定對象臉部置換系統	★ 以單一攝影機實現單指虛擬鍵盤之功能
★ 基於視覺的手寫軌跡注音符號組合辨識系統	★ 利用動態貝氏網路在空照影像中進行車輛偵測
★ 以視訊為基礎之手寫簽名認證	★ 使用膚色與陰影機率高斯混合模型之移動膚色區域偵測
★ 影像中賦予信任等級的群眾切割	★ 航空監控影像之區域切割與分類
★ 在群體人數估計應用中使用不同特徵與回歸方法之分析比較	★ 以視覺為基礎之強韌多指尖偵測與人機介面應用
★ 在夜間受雨滴汙染鏡頭所拍攝的影片下之車流量估計	★ 影像特徵點匹配應用於景點影像檢索
★ 自動感興趣區域切割及遠距交通影像中的軌跡分析	★ 基於回歸模型與利用全天空影像特徵和歷史資訊之短期日射量預測
★ Analysis of the Performance of Different Classifiers for Cloud Detection Application	★ 全天空影像之雲追蹤與太陽遮蔽預測

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 (2029-7-2以後開放)

摘要(中)

近年來，生成式人工智慧的突出表現吸引了大量學者的研究興趣，在自然語言處理、圖像和音頻等領域掀起了一股熱潮。最為特別的是在圖像生成領域中，Diffusion Model 憑藉其卓越的性能在多個應用中取得了顯著的成果，如文生圖和圖生圖等。有鑑於此，本研究提出了一個全新的架構，使得 Diffusion Model 針對姿態轉換任務(Pose Transfer)擁有良好的表現，僅需憑藉參考圖和人體骨架圖即可實現精確的姿態轉換成果。
然而，傳統的 Diffusion Model 是在像素級別上進行運算，來學習圖像特徵，這通常需要龐大的計算資源，僅僅是驗證模型的可行性和測試其性能就需耗時數日，對資源受限的研究單位而言，是一個重大的難題。為了解決這一瓶頸，本論文結合了 Latent Diffusion Model、ControlNet 和多尺度特徵擷取模組，並在注意力神經網路層中加入語意擷取濾波器，使得模型能夠專注於學習影像中最為重要的特徵和姿態之間的關係的同時，也降低運算資源，使得模型可以在RTX 4090 上有效地訓練。
實驗結果表明，我們所提出的模型在硬體成本受限的情況下，能與其他基於 Diffusion Model 建構的模型匹敵，不只在姿態轉換準確度上有顯著地提升，也有效地減少了訓練以及圖像生成所耗費的時間。

摘要(英)

In recent years, generative AI has become popular in areas like natural language processing, image, and audio, significantly expanding AI′s creative capabilities. Particularly in the realm of image generation, Diffusion Models have achieved remarkable success across various applications, such as image synthesis and transformation. Therefore, the present study introduces a new framework that enables Diffusion Models to perform effectively in pose transfer tasks, requiring only a reference image and a human skeleton diagram to achieve precise pose transformations.
However, traditional Diffusion Models operate at the pixel level when learning image features, inevitably demanding substantial computational resources. For organizations with limited resources, merely validating the feasibility of the model and testing its performance could take days, which is a major challenge. To address this issue, this paper integrates the Latent Diffusion Model, ControlNet, and a multi-scale feature extraction module, and incorporates a semantic extraction filter into the attention neural network layer. This allows the model to focus on important image features and the relationships between poses, and the architecture can be effectively trained on an RTX 4090.
Experimental results demonstrate that our proposed method can compete with other Diffusion Model-based approaches under resource constraints, significantly improving pose transfer accuracy and effectively reducing the time required for training and image generation.

關鍵字(中)

★ 擴散模型
★ 姿態轉換
★ OpenPose
★ 生成影像

關鍵字(英)

★ Diffusion Models
★ Pose Transfer
★ OpenPose
★ Image Generation

論文目次

摘要 I
Abstract II
目錄 III
圖目錄 V
表目錄 VI
1 緒論 1
1.1 研究動機 1
1.2 研究目標 2
2 相關研究 3
2.1 姿勢引導人物影像生成 3
2.2 Diffusion Model生成模型 4
2.2.1 Latent Diffusion Model 5
2.3 ControlNet Model 8
2.4 Diffusion Models for Pose Transfer 10
3 研究方法 11
3.1 模型架構 12
3.2 用於姿態轉換之ControlNet模組 13
3.3 多尺度特徵模組 16
3.4 語意擷取濾波器 17
3.5 Classifier-Free Guidance 19
4 實驗結果 21
4.1 設備環境設定 21
4.2 資料集 21
4.3 實作細節 22
4.4 評估指標 22
4.5 定性實驗 24
4.6 定量實驗 26
4.7 消融實驗 28
4.7.1 ControlNet模組之比較 28
4.7.2 語意擷取濾波器之比較 31
4.8 人物服裝編輯 34
4.9 跳舞資料集之應用 35
4.9.1 應用DeepFashion資料集於跳舞姿態轉換 35
4.9.2 舞蹈數據集中姿態適應能力 37
4.9.3 舞蹈風格互換的實驗效果分析 39
5 結論 41
6 限制與未來展望 42
參考文獻 43

參考文獻

[1] L. Ma, X. Jia, B. Schiele, T. Tuytelaars, L. Van Gool. "Pose guided person image generation," Advances in Neural Information Processing Systems 30, 2017, pp. 406-416.
[2] P. Esser, E. Sutter, and B. Ommer, "A variational u-net for conditional appearance and shape generation," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 8857-8866.
[3] Y. Ren, X. Yu, J. Chen, T. H. Li, G. Li, "Deep image spatial transformation for person image generation," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 7690-7699.
[4] Y. Men, Y. Mao, Y. Jiang, W.-Y. Ma, and Z. Lian, "Controllable person image synthesis with attribute-decomposed gan," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 5084-5093.
[5] A. K. Bhunia, S. Khan, H. Cholakkal, R. M. Anwer, J. Laaksonen, M. Shah, and F. S. Khan, "Person image synthesis via denoising diffusion model," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 5968-5976.
[6] J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli, "Deep unsupervised learning using nonequilibrium thermodynamics," in Proceedings of the International Conference on Machine Learning, PMLR, 2015.
[7] P. Dhariwal and A. Nichol, "Diffusion models beat gans on image synthesis," Advances in Neural Information Processing Systems, vol. 34, 2021, pp. 8780–8794, 2021.
[8] A. Q. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. McGrew, I. Sutskever, and M. Chen, "GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models," in Proceedings of the 39th International Conference on Machine Learning, vol. 162, PMLR, 2022, pp. 16784-16804. Available: https://proceedings.mlr.press/v162/nichol22a.html
[9] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, "High-resolution image synthesis with latent diffusion models," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 10684–10695.
[10] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, "Attention Is All You Need," Advances in Neural Information Processing Systems, vol. 30, Curran Associates, Inc., 2017.
[11] P. Esser, R. Rombach, and B. Ommer, "Taming Transformers for High-Resolution Image Synthesis," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 12873-12883.
[12] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, "The Unreasonable Effectiveness of Deep Features as a Perceptual Metric," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 586-595, doi: 10.1109/CVPR.2018.00068.
[13] P. Isola, J. Zhu, T. Zhou, and A. Efros, "Image-to-Image Translation with Conditional Adversarial Networks," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2017, pp. 5967-5976, doi: 10.1109/CVPR.2017
[14] J. Ho, A. Jain, and P. Abbeel, "Denoising Diffusion Probabilistic Models," Advances in Neural Information Processing Systems, Curran Associates, Inc., vol. 33, 2020, pp. 6840–6851.
[15] T. Brooks, A. Holynski, and A. A. Efros, "InstructPix2Pix: Learning To Follow Image Editing Instructions," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 18392-18402.
[16] Y. Shi, C. Xue, J. Liew, J. Pan, H. Yan, W. Zhang, V. Tan, and S. Bai, "DragDiffusion: Harnessing Diffusion Models for Interactive Point-based Image Editing," 2023, arXiv:2306.14435 [cs.CV]. [Online]. Available: https://arxiv.org/abs/2306.14435
[17] L. Zhang, A. Rao, and M. Agrawala, "Adding Conditional Control to Text-to-Image Diffusion Models," in Proceedings of the IEEE/CVF International Conference on Computer Vision, October 2023, pp. 3836-3847.
[18] A. K. Bhunia, S. Khan, H. Cholakkal, R. M. Anwer, J. Laaksonen, M. Shah, and F. S. Khan, "Person Image Synthesis via Denoising Diffusion Model," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2023, pp. 5968-5976.
[19] J. Karras, A. Holynski, T.-C. Wang, and I. Kemelmacher-Shlizerman, "DreamPose: Fashion Image-to-Video Synthesis via Stable Diffusion," in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 22623-22633., doi: 10.1109/ICCV51070.2023.02073.
[20] X. Han, X. Zhu, J. Deng, Y.-Z. Song, and T. Xiang, "Controllable Person Image Synthesis with Pose-Constrained Latent Diffusion," in Proceedings of the IEEE/CVF International Conference on Computer Vision, October 2023, pp. 22768-22777.
[21] Y. Ren, X. Fan, G. Li, S. Liu, and T. H. Li, "Neural Texture Extraction and Distribution for Controllable Person Image Synthesis," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 13525-13534.
[22] J. Ho and T. Salimans, "Classifier-Free Diffusion Guidance," 2022, arXiv:2207.12598 [cs.LG]. [Online]. Available: https://arxiv.org/abs/2207.12598
[23] C. Schuhmann et al., "LAION-5B: An open large-scale dataset for training next generation image-text models," dvances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, Eds., Curran Associates, Inc., vol. 35, 2022, pp. 25278-25294.
[24] J. Song, C. Meng, and S. Ermon, "Denoising Diffusion Implicit Models," in Proceedings of the International Conference on Learning Representations, 2021.
[25] Hsu-Yung Cheng*, C.C. Yu, Chih-Lung Lin, "Generating Dance Videos using Pose Transfer Generative Adversarial Network with Multiple Scale Region Extractor and Learnable Region Normalization," IEEE Multimedia, vol. 29, no. 1, Mar 2022. (SCI, EI)
[26] J. Canny, "A Computational Approach to Edge Detection," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. PAMI-8, no. 6, pp. 679-698, 1986, doi: 10.1109/TPAMI.1986.4767851.
[27] R. Ranftl, K. Lasinger, D. Hafner, K. Schindler, and V. Koltun, "Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-Shot Cross-Dataset Transfer," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 3, pp. 1623-1637, 2022, doi: 10.1109/TPAMI.2020.3019967.
[28] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, "Realtime multi-person 2d pose estimation using part affinity fields," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 7291-7299.
[29] S. Y. Cheong, A. Mustafa, and A. Gilbert, "UPGPT: Universal Diffusion Model for Person Image Generation, Editing and Pose Transfer," in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, Oct. 2023, pp. 4173-4182.
[30] Stefan Elfwing, Eiji Uchibe, and Kenji Doya, "Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning," CoRR, vol. abs/1702.03118, 2017. [Online]. Available: http://arxiv.org/abs/1702.03118.
[31] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale," in International Conference on Learning Representations (ICLR), 2021.

指導教授

鄭旭詠(Cheng, Hsu-Yung)

審核日期

2024-7-18

推文