- Apr 26, 2025: We release the first-stage models of Concat-ID based on Wan2.1-T2V-1.3B for single identity scenarios.
- Improving the model architecture and training strategy.
- Training the Wan2.1-T2V-1.3B on 720P videos.
Identity-preserving video generation is an interesting research topic, and therefore, we look forward to progressively improving this project.
We release the first-stage model of Concat-ID based on Wan2.1-T2V-1.3B for single identity scenarios. We first train Concat-ID-Wan on approximately 600,000 49-frame videos, and then fine-tune it on approximately 700,000 81-frame videos.
git clone https://github.com/modelscope/DiffSynth-Studio.git
cd DiffSynth-Studio
pip install -e .
pip install insightface onnxruntime
python inference_wan.py --image_path="{your_image_path}" --prompt="{your_prompt}"
python inference_wan.py --image_path="examples/images/69_politicians_woman_Tulsi_Gabbard_4.png" --prompt="A woman, dressed in casual attire, sits by a sunlit window sketching in a notebook, pausing occasionally to look up with a playful grin. As sunlight filters through the sheer curtains, casting soft shadows across the room, the woman twirls a pencil absentmindedly before adding quick strokes to the page. The scene is alive with a sense of relaxed creativity, as the warm afternoon glow bathes the space in a gentle, inviting atmosphere." --output_dir="output/1/"
python inference_wan.py --image_path="examples/images/43_stars_man_Leonardo_DiCaprio_3.png" --prompt="A man with a distant look in his eyes stands alone on the deck of the Titanic, gripping the railing tightly. He watches the horizon, lost in thought, as the cold sea breeze brushes against his face. His mind drifts between excitement for the journey ahead and an unshakable sense of unease about what lies beneath the surface of the dark, mysterious waters." --output_dir="output/2/"
python inference_wan.py --image_path="examples/images/80_normal_man_5.jpg" --prompt="On a warm summer evening, a tall and athletic person is energetically playing a basketball game on the well-lit community court, dribbling the ball with expert precision, dodging imaginary opponents, and shooting hoops with impressive accuracy." --output_dir="output/3/"
Reference images | Generated videos |
---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
This part, including both inference and training code, is based on CogVideoX1.0 SAT and SAT.
We only tested the inference script on the H800, and it required a maximum of 24 GB of GPU memory. However, this requirement may vary depending on different server environments.
pip install -r requirements.txt
First, download the model weights from the SAT mirror to the project directory.
pip install modelscope
modelscope download --model 'yongzhong/Concat-ID' --local_dir 'models'
or
git lfs install
git clone https://www.modelscope.cn/yongzhong/Concat-ID.git models
If you want to download pre-training model in the first stage, just replace Concat-ID
with Concat-ID-pre-training
and replace models
with pre-training-models
. Note that the pre-training model offers better identity consistency but lower editability.
args:
load: "./models/single-identity/" # Absolute path to transformer folder
input_file: examples/single_identity.txt # Plain text file, can be edited
output_dir: outputs/single-identity
- Each line in
single_identity.txt
should follow the format{prompt}@@{image_path}
, where{image_path}
indicates the path to the reference image, and{prompt}
indicates the corresponding prompt. If you are unsure how to write prompts, use this code to call an LLM for refinement. - To modify the output video location, change the
output_dir
parameter.
Modifying the configuration for multiple identities is similar to doing so for a single identity.
For single identity,
bash inference_single_identity.sh
For multiple identities,
bash inference_two_identities.sh
We need training-data.json
and validation-data.json
for fine-tuning and validation. Each element in the JSON file should be a dictionary formatted as follows:
{
"video_path": "{video_path}",
"caption": "{prompt}",
"image_path": "{image_path}"
}
{video_path}
indicates the path of a training video.{prompt}
indicates the corresponding prompt.{image_path}
indicates the path of the corresponding reference image. For multiple identities, use@@
to distinguish different reference images.
For example, training-data.json
with one training sample would look like this:
[
{
"video_path": "/videos/training_1.mp4",
"caption": "A man.",
"image_path": "/images/reference_1.png"
},
]
For multiple reference images:
[
{
"video_path": "/videos/training_1.mp4",
"caption": "Two people.",
"image_path": "/images/reference_1.png@@/images/reference_2.png"
},
]
We only tested full-parameter fine-tuning.
We need to specify the paths of both training data and validation data in configs/training/sft_single_identity.yaml
for single identity and in configs/training/sft_two_identities.yaml
for multiple identities:
train_data: [ "your_train_data_path" ]
valid_data: [ "your_val_data_path" ] # Training and validation sets
For example:
train_data: [ "/json/training-data.json" ]
valid_data: [ "/json/validation-data.json" ] # Training and validation sets
For single identity:
bash finetune_single_identity.sh # Multi GPUs
For multiple identities,
bash finetune_two_identities.sh # Multi GPUs
The SAT weight format differs from Huggingface's format. If you want to convert the weights, please run this script.
- Due to limitations in the base model’s capabilities (i.e., CogVideoX-5B), we do not compare our method with closed-source commercial tools.
- Currently, we utilize VAEs solely as feature extractors, relying on the model’s inherent ability to process low-level features.
- Similar to common video generation models, our approach faces challenges in preserving the integrity of human body structures, such as the number of fingers, when handling particularly complex motions.
We appreciate the following works: