Skip to content

Code release for "PISA Experiments: Exploring Physics Post-Training for Video Diffusion Models by Watching Stuff Drop" (arXiv 2025)

License

Notifications You must be signed in to change notification settings

vision-x-nyu/pisa-experiments

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Β Pisa Experiments:
Exploring Physics Post-Training for Video Diffusion Models by Watching Stuff Drop

arXiv Website HF Dataset: PisaBench
1New York University  2Intel Labs 

Our PISA (Physics-Informed Simulation and Alignment) evaluation framework includes a new video dataset, where objects are dropped in a variety of real-world (Left) and synthetic (Right) scenes. For visualization purposes, we depict object motion by overlaying multiple video frames in each image shown above. Our real-world videos enable us to evaluate the physical accuracy of generated video output, and our synthetic videos enable us to improve accuracy through the use of post-training alignment methods.

Release

Contents

Installation

Clone the repository and submodules:

git clone [email protected]:vision-x-nyu/pisa-experiments.git
cd pisa-experiments
git submodule update --init --recursive

Create conda environment:

conda create --name pisa python=3.10
conda activate pisa

Evaluation

To run evaluation, please install SAM 2 dependencies. Installation details can be found at: SAM 2.

Simulation

We have created a conda envionrment that is able to support Kubric. However, Kubric recommends using a Docker container, as some users have reported difficulties when directly intalling the dependencies into an environment. If you are having trouble you may want to try Docker, and the instructions can be found here.

For conda, please run:

pip install -r sim_data/requirements.txt

Post-Training

Our post-training code is based on Open-Sora, Depth-Anything-V2, SAM 2, and RAFT. To install Open-Sora dependencies:

pip install -r requirements/requirements-cu121.txt
pip install -v -e .

# Optional, recommended for fast speed, especially for training
# install flash attention
# set enable_flash_attn=False in config to disable flash attention
pip install packaging ninja
pip install flash-attn --no-build-isolation

# install apex
# set enable_layernorm_kernel=False in config to disable apex
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" git+https://github.com/NVIDIA/apex.git

Installation details for Depth-Anything-V2, SAM 2, RAFT can be found at:

PisaBench

Real World Videos

We curate a dataset comprising 361 videos demonstrating the dropping task.Each video begins with an object suspended by an invisible wire in the first frame. We cut the video clips to begin as soon as the wire is released and record the videos in slow-motion at 120 frames per second (fps) with cellphone cameras mounted on tripods to eliminate camera motion.

We save each video in the following fomat:

β”œβ”€β”€ 00000.jpg
β”œβ”€β”€ 00001.jpg
...
β”œβ”€β”€ movie.mp4
└── clip_info.json
  • clip_info.json is a json file that contains positive/negative point annotations and text descriptions for each video.

Real world videos can be found at: pisabench/real.zip.

Simulated Test Videos

Since our post-training process uses a dataset of simulated videos, we also create a simulation test-set of 60 videos for understanding sim2real transfer. We create two splits of 30 videos each: one featuring objects and backgrounds seen during training, and the other featuring unseen objects and backgrounds.

We save each video in the following format:

β”œβ”€β”€ rbga_00000.jpg
β”œβ”€β”€ rbga_00001.jpg
...
β”œβ”€β”€ movie.gif
β”œβ”€β”€ mask.npz
β”œβ”€β”€ clip_info.json
  • mask.npz is segmentation masks for all objects with shape [V, N, H, W], where V is the number of video frames, N is the number of objects, H is the height, and W is the width.
  • clip_info.json is a json file that contains annotations and text descriptions for each video.

Simulated test videos can be found at: pisabench/sim.zip.

Evaluation

When evaluating a new model, please convert our videos at the corresponding resolution. Our evaluation framework currently supports 1:1 aspect ratio. We provide example scripts to convert the resolution:

# Real world videos.
bash scripts/data_processing/convert_real.sh

# Simulated test videos.
bash scripts/data_processing/convert_sim.sh

Metrics calculations require segmentation masks, and we provide scripts to generate segmentation masks using SAM 2:

# Download SAM 2 checkpoint.
cd models && bash download_sam2.sh && cd ..

# Generate masks.
bash scripts/data_processing/generate_mask.sh

After generating masks, you can run the evaluation. We provide an example script to run the evaluation:

# Real world videos.
bash scripts/evaluation/eval_real.sh

# Simulated test videos.
bash scripts/evaluation/eval_sim.sh

The example config files are in configs. You can modify the config files to run the evaluation on your model.

Evaluation Results

We evaluate 4 open models including CogVideoX-5B-I2V, DynamiCrafter, Pyramid-Flow, and Open-Sora-V1.2, as well as 4 closed models including Sora, Kling-V1, Kling-V1.5, and Runway Gen3. We also evaluate OpenSora post-trained through the processes of Supervised Fine-Tuning (PSFT) and Object Reward Optimization (ORO).

Data Simulation

We use Google's Kubric for generating simulated physics videos. Kubric combines PyBullet and Blender for handling simulation and rendering seamlessly in a unified library.

We use the Google Scanned Objects (GSO) dataset which is already supported in Kubric. The GSO dataset consists of ~1000 high quality 3D objects that come from scans of a variety of everyday objects.

For generating a single video, please run:

bash sim_data/generate_single_sample.sh

If you would like to generate many examples in parallel, you can run:

bash sim_data/generate_parallel.sh

Post-Training

Our approach for post-training is inspired by the two-stage pipeline consisting of supervised fine-tuning followed by reward modeling commonly used in LLMs. We provide an example script to run the inference:

bash scripts/inference/inference.sh

Stage 1: Physics Supervised Fine-Tuning (PSFT)

We fine-tune Open-Sora in simulated videos. We provide an example script to run PSFT:

bash scripts/post_training/base.sh

Stage 2: Object Reward Optimization (ORO)

We propose Segmentation Reward, Optical Flow Reward, and Depth Reward and implement them in VADER framework. We provide example scripts to run ORO:

# Download SAM 2 and Depth-Anything-V2 checkpoints.
cd models && bash download_sam2.sh && bash download_depth_anything.sh && cd ..

# ORO(Seg)
bash scripts/post_training/oro_seg.sh

# ORO(Flow)
bash scripts/post_training/oro_flow.sh

# ORO(Depth)
bash scripts/post_training/oro_depth.sh

Acknowledgements

We appreciate the following GitHub repos a lot for their valuable code and efforts:

Citation

If you find our paper and code useful in your research, please consider giving us a star ⭐ and citing our work πŸ“.

@article{li2025pisa,
  title={PISA Experiments: Exploring Physics Post-Training for Video Diffusion Models by Watching Stuff Drop},
  author={Li, Chenyu and Michel, Oscar and Pan, Xichen and Liu, Sainan and Roberts, Mike and Xie, Saining},
  journal={arXiv preprint arXiv:2503.09595},
  year={2025}
}

Contact

If you have any questions or suggestions, please feel free to contact:

About

Code release for "PISA Experiments: Exploring Physics Post-Training for Video Diffusion Models by Watching Stuff Drop" (arXiv 2025)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published