GTF: Towards Generalized and Training-Free Text-Guided Semantic Manipulation

Yu Hong¹, Xiao Cai¹, Pengpeng Zeng², Shuai Zhang¹, Jingkuan Song², Lianli Gao¹, Heng Tao Shen²,

¹University of Electronic Science and Technology of China, ²Tongji University

Paper Code (Soon) arXiv

GTF Editing Results

Addition

Removal

Style Transfer

+ "wearing a red dog collar"

- "olive on table"

+ "a stained-glass window"

+ "painting of a heart"

- "fences"

+ "sunset, oil painting"

Original Video

Ours + AnimateDiff

Original Video

Ours + AnimateDiff

Original Video

Ours + AnimateDiff

+ "followed by a helicopter"

- "a field of dandelions"

+ "a cyberpunk cityscape"

Original Video

Ours + AnimateDiff

Original Video

Ours + AnimateDiff

Original Video

Ours + AnimateDiff

+ "wearing a hat"

- "a guitar"

+ "cyberpunk style"

Visual Results of 3D Editing

Left: RGB display; Right: Depth display

Original Object

Ours + LucidDreamer

+ "on rock, wearing VR goggles"

Original Object

Ours + LucidDreamer

+ "on gold plot with glass lid"

Original Object

Ours + LucidDreamer

- "an armor"

Original Object

Ours + LucidDreamer

+ "wearing halloween costume"

Original Object

Ours + LucidDreamer

+ "low-poly game art style"

Original Object

Ours + LucidDreamer

+ "flames"

Method

Given a pair of source and target prompts, we aim to perform semantic addition or removal by combining their corresponding noise predictions. Specifically, at each diffusion step, we first predict the unconditional noise, followed by two conditional noise predictions based on the source and target prompts, respectively. Depending on the manipulation type (addition or removal), we apply our algorithm to combine these noises and compute the final guidance noise for the current step. This guidance is applied iteratively at every denoising step, progressively steering the generation toward the edited result.

Comparisons on Image Editing

We combine GTF with Stable Diffusion (SD)[1] and compare with Prompt-to-Prompt (P2P)[2], MasaCtrl[3], MDP[4], Contrastive Guidance (CG)[5], LEDITS++[6] and show the qualitative results here^*. From top to bottom: Semantic addition, semantic removal, style transfer

+ "with some kids playing around"

- "grassy"

+ "stained-glass window"

+ "with a helicopter following"

- "white hair and deep wrinkles"

+ "a sketch"

+ "heavy raining futuristic Tokyo rooftop cyberpunk night"

- "holding flowers"

+ "post-impressionism style"

Comparisons on Video Editing

We combine GTF with AnimateDiff[8] and compare with FateZero[9], FLATTEN[10], Rerender-A-Video[11], TokenFlow[12], Vid2Vid-Zero[13], VidToMe[14] and show the qualitative results here. From top to bottom: Semantic addition, semantic removal, style transfer

Original Video

FateZero

FLATTEN

Rerender-A-Video

TokenFlow

Vid2Vid-Zero

VidToMe

Ours + AnimateDiff

+ "flying butterflies"

Original Video

FateZero

FLATTEN

Rerender-A-Video

TokenFlow

Vid2Vid-Zero

VidToMe

Ours + AnimateDiff

- "sunglasses"

Original Video

FateZero

FLATTEN

Rerender-A-Video

TokenFlow

Vid2Vid-Zero

VidToMe

Ours + AnimateDiff

+ "pop art"

^* We do not include visual comparison with SEGA[7] here. Since SEGA does not include inversion in its pipeline, it is hard to synthesize the exact source images if the initial random seeds are not known. To mitigate this problem, in the quantitative comparisons present in our paper, for each pair of prompts we generate an exclusive source image for SEGA and further calculate its metrics based on this image.

[1] Rombach, Robin, et al. "High-resolution image synthesis with latent diffusion models." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022.
[2] Hertz, Amir, et al. "Prompt-to-prompt image editing with cross attention control." arXiv preprint arXiv:2208.01626 (2022).
[3] Cao, Mingdeng, et al. "Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing." Proceedings of the IEEE/CVF international conference on computer vision. 2023.
[4] Wang, Qian, et al. "Mdp: A generalized framework for text-guided image editing by manipulating the diffusion path." arXiv preprint arXiv:2303.16765 (2023).
[5] Wu, Chen, and Fernando De la Torre. "Contrastive prompts improve disentanglement in text-to-image diffusion models." arXiv preprint arXiv:2402.13490 (2024).
[6] Brack, Manuel, et al. "Ledits++: Limitless image editing using text-to-image models." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2024.
[7] Brack, Manuel, et al. "Sega: Instructing text-to-image models using semantic guidance." Advances in Neural Information Processing Systems 36 (2023): 25365-25389.
[8] Guo, Yuwei, et al. "Animatediff: Animate your personalized text-to-image diffusion models without specific tuning." arXiv preprint arXiv:2307.04725 (2023).
[9] Qi, Chenyang, et al. "Fatezero: Fusing attentions for zero-shot text-based video editing." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.
[10] Cong, Yuren, et al. "Flatten: optical flow-guided attention for consistent text-to-video editing." arXiv preprint arXiv:2310.05922 (2023).
[11] Yang, Shuai, et al. "Rerender a video: Zero-shot text-guided video-to-video translation." SIGGRAPH Asia 2023 Conference Papers. 2023.
[12] Geyer, Michal, et al. "Tokenflow: Consistent diffusion features for consistent video editing." arXiv preprint arXiv:2307.10373 (2023).
[13] Wang, Wen, et al. "Zero-shot video editing using off-the-shelf image diffusion models." arXiv preprint arXiv:2303.17599 (2023).
[14] Li, Xirui, et al. "Vidtome: Video token merging for zero-shot video editing." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

GTF: Towards Generalized and Training-Free Text-Guided Semantic Manipulation

GTF Editing Results

Visual Results of 3D Editing

Abstract

Method

Comparisons on Image Editing

Comparisons on Video Editing

BibTeX