dc.description.abstract | In recent years, generative AI has become popular in areas like natural language processing, image, and audio, significantly expanding AI′s creative capabilities. Particularly in the realm of image generation, Diffusion Models have achieved remarkable success across various applications, such as image synthesis and transformation. Therefore, the present study introduces a new framework that enables Diffusion Models to perform effectively in pose transfer tasks, requiring only a reference image and a human skeleton diagram to achieve precise pose transformations.
However, traditional Diffusion Models operate at the pixel level when learning image features, inevitably demanding substantial computational resources. For organizations with limited resources, merely validating the feasibility of the model and testing its performance could take days, which is a major challenge. To address this issue, this paper integrates the Latent Diffusion Model, ControlNet, and a multi-scale feature extraction module, and incorporates a semantic extraction filter into the attention neural network layer. This allows the model to focus on important image features and the relationships between poses, and the architecture can be effectively trained on an RTX 4090.
Experimental results demonstrate that our proposed method can compete with other Diffusion Model-based approaches under resource constraints, significantly improving pose transfer accuracy and effectively reducing the time required for training and image generation. | en_US |