Left: RGB display; Right: Depth display
Text-guided semantic manipulation refers to semantically editing an image generated from a source prompt to match a target prompt, enabling the desired semantic changes (e.g., addition, removal, and style transfer) while preserving irrelevant contents. With the powerful generative capabilities of the diffusion model, the task has shown the potential to generate high-fidelity visual content. Nevertheless, existing methods either typically require time-consuming fine-tuning (inefficient), fail to accomplish multiple semantic manipulations (poorly extensible), and/or lack support for different modality tasks (limited generalizability). Upon further investigation, we find that the geometric properties of noises in the diffusion model are strongly correlated with the semantic changes. Motivated by this, we propose a novel GTF for text-guided semantic manipulation, which has the following attractive capabilities: 1) Generalized: our GTF supports multiple semantic manipulations (e.g., addition, removal, and style transfer) and can be seamlessly integrated into all diffusion-based methods (i.e., Plug-and-play) across different modalities (i.e., modality-agnostic); and 2) Training-free: GTF produces high-fidelity results via simply controlling the geometric relationship between noises without tuning or optimization. Our extensive experiments demonstrate the efficacy of our approach, highlighting its potential to advance the state-of-the-art in semantics manipulation.
Given a pair of source and target prompts, we aim to perform semantic addition or removal by combining their corresponding noise predictions. Specifically, at each diffusion step, we first predict the unconditional noise, followed by two conditional noise predictions based on the source and target prompts, respectively. Depending on the manipulation type (addition or removal), we apply our algorithm to combine these noises and compute the final guidance noise for the current step. This guidance is applied iteratively at every denoising step, progressively steering the generation toward the edited result.
We combine GTF with Stable Diffusion (SD)[1] and compare with Prompt-to-Prompt (P2P)[2], MasaCtrl[3], MDP[4], Contrastive Guidance (CG)[5], LEDITS++[6] and show the qualitative results here*. From top to bottom: Semantic addition, semantic removal, style transfer
We combine GTF with AnimateDiff[8] and compare with FateZero[9], FLATTEN[10], Rerender-A-Video[11], TokenFlow[12], Vid2Vid-Zero[13], VidToMe[14] and show the qualitative results here. From top to bottom: Semantic addition, semantic removal, style transfer
@article{hong2025towards,
title={Towards Generalized and Training-Free Text-Guided Semantic Manipulation},
author={Hong, Yu and Cai, Xiao and Zeng, Pengpeng and Zhang, Shuai and Song, Jingkuan and Gao, Lianli and Shen, Heng Tao},
journal={arXiv preprint arXiv:2504.17269},
year={2025}
}
* We do not include visual comparison with SEGA[7] here. Since SEGA does not include inversion in its pipeline, it is hard to synthesize the exact source images if the initial random seeds are not known. To mitigate this problem, in the quantitative comparisons present in our paper, for each pair of prompts we generate an exclusive source image for SEGA and further calculate its metrics based on this image.
[1] Rombach, Robin, et al. "High-resolution image synthesis with latent diffusion models." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022. [2] Hertz, Amir, et al. "Prompt-to-prompt image editing with cross attention control." arXiv preprint arXiv:2208.01626 (2022). [3] Cao, Mingdeng, et al. "Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing." Proceedings of the IEEE/CVF international conference on computer vision. 2023. [4] Wang, Qian, et al. "Mdp: A generalized framework for text-guided image editing by manipulating the diffusion path." arXiv preprint arXiv:2303.16765 (2023). [5] Wu, Chen, and Fernando De la Torre. "Contrastive prompts improve disentanglement in text-to-image diffusion models." arXiv preprint arXiv:2402.13490 (2024). [6] Brack, Manuel, et al. "Ledits++: Limitless image editing using text-to-image models." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2024. [7] Brack, Manuel, et al. "Sega: Instructing text-to-image models using semantic guidance." Advances in Neural Information Processing Systems 36 (2023): 25365-25389. [8] Guo, Yuwei, et al. "Animatediff: Animate your personalized text-to-image diffusion models without specific tuning." arXiv preprint arXiv:2307.04725 (2023). [9] Qi, Chenyang, et al. "Fatezero: Fusing attentions for zero-shot text-based video editing." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023. [10] Cong, Yuren, et al. "Flatten: optical flow-guided attention for consistent text-to-video editing." arXiv preprint arXiv:2310.05922 (2023). [11] Yang, Shuai, et al. "Rerender a video: Zero-shot text-guided video-to-video translation." SIGGRAPH Asia 2023 Conference Papers. 2023. [12] Geyer, Michal, et al. "Tokenflow: Consistent diffusion features for consistent video editing." arXiv preprint arXiv:2307.10373 (2023). [13] Wang, Wen, et al. "Zero-shot video editing using off-the-shelf image diffusion models." arXiv preprint arXiv:2303.17599 (2023). [14] Li, Xirui, et al. "Vidtome: Video token merging for zero-shot video editing." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.