A example using Textual Inversion method to personalize text2image

note: the example is integrating INC in progress.

Textual inversion is a method to personalize text2image models like stable diffusion on your own images.By using just 3-5 images new concepts can be taught to Stable Diffusion and the model personalized on your own images The textual_inversion.py script shows how to implement the training procedure and adapt it for stable diffusion.

Installing the dependencies

Before running the scripts, make sure to install the library's training dependencies:

pip install -r requirements.txt

Note: intel_extension_for_pytorch (IPEX) version should match PyTorch version

Nezha cartoon example

You need to accept the model license before downloading or using the weights. In this example we'll use model version v1-4, so you'll need to visit its card, read the license and tick the checkbox if you agree.

You have to be a registered user in 🤗 Hugging Face Hub, and you'll also need to use an access token for the code to work. For more information on access tokens, please refer to this section of the documentation.

Run the following command to authenticate your token

huggingface-cli login

If you have already cloned the repo, then you won't need to go through these steps.

Now let's get our dataset. We just use one picture which is from the huggingface datasets sd-concepts-library/dicoo2, and save it to the ./dicoo directory. The picture show below:

finetune with CPU using IPEX

The following script shows how to use CPU with BF16

export MODEL_NAME="CompVis/stable-diffusion-v1-4"
export DATA_DIR="./dicoo"

# add use_bf16
python textual_inversion_ipex.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --train_data_dir=$DATA_DIR \
  --learnable_property="object" \
  --placeholder_token="<dicoo>" --initializer_token="toy" \
  --resolution=512 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=4 \
  --use_bf16 \
  --max_train_steps=3000 \
  --learning_rate=5.0e-04 --scale_lr \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --output_dir="dicoo_model"

Distributed

You need to install Torch-CCL first.

oneccl_bindings_for_pytorch_path=$(python -c "from oneccl_bindings_for_pytorch import cwd; print(cwd)")
source $oneccl_bindings_for_pytorch_path/env/setvars.sh
export MODEL_NAME="CompVis/stable-diffusion-v1-4"
export DATA_DIR="./dicoo"

python -m intel_extension_for_pytorch.cpu.launch \
  --throughput_mode \
  --distributed \
    textual_inversion_ipex.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --train_data_dir=$DATA_DIR \
  --learnable_property="object" \
  --placeholder_token="<dicoo>" --initializer_token="toy" \
  --resolution=512 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=4 \
  --use_bf16 \
  --max_train_steps=3000 \
  --learning_rate=5.0e-04 --scale_lr \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --output_dir="dicoo_model"

finetune with GPU using accelerate

Initialize an 🤗Accelerate environment with:

accelerate config

And launch the training using

export MODEL_NAME="CompVis/stable-diffusion-v1-4"
export DATA_DIR="./dicoo"

accelerate launch textual_inversion.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --train_data_dir=$DATA_DIR \
  --learnable_property="object" \
  --placeholder_token="<dicoo>" --initializer_token="toy" \
  --resolution=512 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=4 \
  --max_train_steps=3000 \
  --learning_rate=5.0e-04 --scale_lr \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --output_dir="dicoo_model"

Inference

Once you have trained a model using above command, the inference can be done simply using the StableDiffusionPipeline. Make sure to include the placeholder_token in your prompt.

from diffusers import StableDiffusionPipeline
import torch

model_id = "dicoo_model"

# use cpu (FP32 by default; low precision coming soon)
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float)

# use gpu
#pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda")

prompt = "a lovely <dicoo> in red dress and hat, in the snowy and brightly night, with many brightly buildings."

image = pipe(prompt, num_inference_steps=50, guidance_scale=7.5).images[0]

image.save("./generated_images/dicoo_christmas.png")

Here is a sample image generated by the fine-tuned model: