FIND: Fine-tuning Initial Noise Distribution with Policy Optimization for Diffusion Models

Abstract

In recent years, large-scale pre-trained diffusion models have demonstrated their outstanding capabilities in image and video generation tasks. However, existing models tend to produce visual objects commonly found in the training dataset, which diverges from user input prompts. The underlying reason behind the inaccurate generated results lies in the model's difficulty in sampling from specific intervals of the initial noise distribution corresponding to the prompt. Moreover, it is challenging to directly optimize the initial distribution, given that the diffusion process involves multiple denoising steps. In this paper, we introduce a Fine-tuning Initial Noise Distribution (FIND) framework with policy optimization, which unleashes the powerful potential of pre-trained diffusion networks by directly optimizing the initial distribution to align the generated contents with user-input prompts. To this end, we first reformulate the diffusion denoising procedure as a one-step Markov decision process and employ policy optimization to directly optimize the initial distribution. In addition, a dynamic reward calibration module is proposed to ensure training stability during optimization. Furthermore, we introduce a ratio clipping algorithm to utilize historical data for network training and prevent the optimized distribution from deviating too far from the original policy to restrain excessive optimization magnitudes. Extensive experiments demonstrate the effectiveness of our method in both text-to-image and text-to-video tasks, surpassing SOTA methods in achieving consistency between prompts and the generated content. Our method achieves 10 times faster than the SOTA approach.

Overview

Experiments

Quality comparation with SOTA methods

To verify the effectiveness of our method, we conduct both qual- itative and quantitative experiments. We compare the proposed method with the standard Stable Diffusion v1.5 . Additionally, we also compare our approach with state-of-the-art approaches DPOK that optimize the entire U-Net using Reinforcement Learning to highlight the efficiency and effectiveness of optimizing the initial noise. Following the similar setting from DPOK, we select four prompts, A green dog is running on the grass, A dog and a cat, Four pandas, A dog on the moon, for fair comparison. Simulation Experiments

Ablation Study

We conduct ablation studies by utilizing two complex prompts: A red book and a yellow vase. and oil portrait of Batman holding a picture of Spiderman, intricate, elegant, highly detailed, lighting, painting, art station, smooth, illustration, art by Greg Rutkowski and Alphonse Mucha. We generate 100 samples for each scenario to serve as our test dataset. Simulation Experiments

Generalize to Video Diffusions

Our approach is theoretically applicable to any diffusion-based method, whether it be text-to-image, text-to-video, text-to-3D, and so forth. To demonstrate the versatility of our method, we use text-to-video as a case study, analyzing its performance both qual- itatively and quantitatively. Specifically, we employ ModelScope as our baseline model, which is a large-scale text-to-video dif- fusion model trained on large-scale datasets. ViCLIP is a pre-trained model used to evaluate the similarity between text and video, which is utilized as our reward function. To verify the effectiveness of our method, we selected four sets of prompts that the baseline models struggle to generate directly: A green dog is running on the grass., A dog is running on the moon., A panda is walking on the grass, from left to right. and A monkey is playing guitar. These cover unusual colors, displacement control, anomalous positions, and abnormal behaviors.

A green dog is running on the grass

A dog is running on the moon

A panda is walking on the grass, from left to right

A monkey is playing guitar

BibTeX

@inproceedings{chen2024find,
      title={FIND: Fine-tuning Initial Noise Distribution with Policy Optimization for Diffusion Models},
      author={Chen, Changgu and Yang, Libing and Yang, Xiaoyan and Chen, Lianggangxu and He, Gaoqi and Wang, Changbo and Li, Yang},
      booktitle={ACM Multimedia 2024}
    }