Pisa Experiments:
Exploring Physics Post-Training for Video Diffusion Models by Watching Stuff Drop

Chenyu Li^1*, Oscar Michel^1*, Xichen Pan¹, Sainan Liu², Mike Roberts², Saining Xie¹

¹New York University ²Intel Labs

Our PISA (Physics-Informed Simulation and Alignment) evaluation framework includes a new video dataset, where objects are dropped in a variety of real-world (Left) and synthetic (Right) scenes. For visualization purposes, we depict object motion by overlaying multiple video frames in each image shown above. Our real-world videos enable us to evaluate the physical accuracy of generated video output, and our synthetic videos enable us to improve accuracy through the use of post-training alignment methods.

Release

2025-3-12 🚀 We released our PisaBench, training data, model checkpoints, and code.

Installation

Clone the repository and submodules:

git clone [email protected]:vision-x-nyu/pisa-experiments.git
cd pisa-experiments
git submodule update --init --recursive

Create conda environment:

conda create --name pisa python=3.10
conda activate pisa

Evaluation

To run evaluation, please install SAM 2 dependencies. Installation details can be found at: SAM 2.

Simulation

We have created a conda envionrment that is able to support Kubric. However, Kubric recommends using a Docker container, as some users have reported difficulties when directly intalling the dependencies into an environment. If you are having trouble you may want to try Docker, and the instructions can be found here.

For conda, please run:

pip install -r sim_data/requirements.txt

Post-Training

Our post-training code is based on Open-Sora, Depth-Anything-V2, SAM 2, and RAFT. To install Open-Sora dependencies:

pip install -r requirements/requirements-cu121.txt
pip install -v -e .

# Optional, recommended for fast speed, especially for training
# install flash attention
# set enable_flash_attn=False in config to disable flash attention
pip install packaging ninja
pip install flash-attn --no-build-isolation

# install apex
# set enable_layernorm_kernel=False in config to disable apex
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" git+https://github.com/NVIDIA/apex.git

Installation details for Depth-Anything-V2, SAM 2, RAFT can be found at:

PisaBench

Real World Videos

We curate a dataset comprising 361 videos demonstrating the dropping task.Each video begins with an object suspended by an invisible wire in the first frame. We cut the video clips to begin as soon as the wire is released and record the videos in slow-motion at 120 frames per second (fps) with cellphone cameras mounted on tripods to eliminate camera motion.

We save each video in the following fomat:

├── 00000.jpg
├── 00001.jpg
...
├── movie.mp4
└── clip_info.json

clip_info.json is a json file that contains positive/negative point annotations and text descriptions for each video.

Real world videos can be found at: pisabench/real.zip.

Simulated Test Videos

Since our post-training process uses a dataset of simulated videos, we also create a simulation test-set of 60 videos for understanding sim2real transfer. We create two splits of 30 videos each: one featuring objects and backgrounds seen during training, and the other featuring unseen objects and backgrounds.

We save each video in the following format:

├── rbga_00000.jpg
├── rbga_00001.jpg
...
├── movie.gif
├── mask.npz
├── clip_info.json

mask.npz is segmentation masks for all objects with shape [V, N, H, W], where V is the number of video frames, N is the number of objects, H is the height, and W is the width.
clip_info.json is a json file that contains annotations and text descriptions for each video.

Simulated test videos can be found at: pisabench/sim.zip.

Evaluation

When evaluating a new model, please convert our videos at the corresponding resolution. Our evaluation framework currently supports 1:1 aspect ratio. We provide example scripts to convert the resolution:

# Real world videos.
bash scripts/data_processing/convert_real.sh

# Simulated test videos.
bash scripts/data_processing/convert_sim.sh

Metrics calculations require segmentation masks, and we provide scripts to generate segmentation masks using SAM 2:

# Download SAM 2 checkpoint.
cd models && bash download_sam2.sh && cd ..

# Generate masks.
bash scripts/data_processing/generate_mask.sh

After generating masks, you can run the evaluation. We provide an example script to run the evaluation:

# Real world videos.
bash scripts/evaluation/eval_real.sh

# Simulated test videos.
bash scripts/evaluation/eval_sim.sh

The example config files are in configs. You can modify the config files to run the evaluation on your model.

Evaluation Results

We evaluate 4 open models including CogVideoX-5B-I2V, DynamiCrafter, Pyramid-Flow, and Open-Sora-V1.2, as well as 4 closed models including Sora, Kling-V1, Kling-V1.5, and Runway Gen3. We also evaluate OpenSora post-trained through the processes of Supervised Fine-Tuning (PSFT) and Object Reward Optimization (ORO).

Data Simulation

We use Google's Kubric for generating simulated physics videos. Kubric combines PyBullet and Blender for handling simulation and rendering seamlessly in a unified library.

We use the Google Scanned Objects (GSO) dataset which is already supported in Kubric. The GSO dataset consists of ~1000 high quality 3D objects that come from scans of a variety of everyday objects.

For generating a single video, please run:

bash sim_data/generate_single_sample.sh

If you would like to generate many examples in parallel, you can run:

bash sim_data/generate_parallel.sh

Post-Training

Our approach for post-training is inspired by the two-stage pipeline consisting of supervised fine-tuning followed by reward modeling commonly used in LLMs. We provide an example script to run the inference:

bash scripts/inference/inference.sh

Stage 1: Physics Supervised Fine-Tuning (PSFT)

We fine-tune Open-Sora in simulated videos. We provide an example script to run PSFT:

bash scripts/post_training/base.sh

Stage 2: Object Reward Optimization (ORO)

We propose Segmentation Reward, Optical Flow Reward, and Depth Reward and implement them in VADER framework. We provide example scripts to run ORO:

# Download SAM 2 and Depth-Anything-V2 checkpoints.
cd models && bash download_sam2.sh && bash download_depth_anything.sh && cd ..

# ORO(Seg)
bash scripts/post_training/oro_seg.sh

# ORO(Flow)
bash scripts/post_training/oro_flow.sh

# ORO(Depth)
bash scripts/post_training/oro_depth.sh

Acknowledgements

We appreciate the following GitHub repos a lot for their valuable code and efforts:

Open-Sora (https://github.com/hpcaitech/Open-Sora)
VADER (https://github.com/mihirp1998/VADER)
SAM 2 (https://github.com/facebookresearch/sam2)
Depth-Anything-V2 (https://github.com/DepthAnything/Depth-Anything-V2)
RAFT (https://github.com/princeton-vl/RAFT)

Citation

If you find our paper and code useful in your research, please consider giving us a star ⭐ and citing our work 📝.

@article{li2025pisa,
  title={PISA Experiments: Exploring Physics Post-Training for Video Diffusion Models by Watching Stuff Drop},
  author={Li, Chenyu and Michel, Oscar and Pan, Xichen and Liu, Sainan and Roberts, Mike and Xie, Saining},
  journal={arXiv preprint arXiv:2503.09595},
  year={2025}
}

Contact

If you have any questions or suggestions, please feel free to contact:

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Depth @ e5a2732		Depth @ e5a2732
RAFT @ 3fa0bb0		RAFT @ 3fa0bb0
configs		configs
data_processing		data_processing
figures		figures
kubric @ 0ee21e2		kubric @ 0ee21e2
models		models
opensora		opensora
requirements		requirements
sam2		sam2
scripts		scripts
utils		utils
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
eval.py		eval.py
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pisa Experiments:
Exploring Physics Post-Training for Video Diffusion Models by Watching Stuff Drop

Release

Contents

Installation

Evaluation

Simulation

Post-Training

PisaBench

Real World Videos

Simulated Test Videos

Evaluation

Evaluation Results

Data Simulation

Post-Training

Stage 1: Physics Supervised Fine-Tuning (PSFT)

Stage 2: Object Reward Optimization (ORO)

Acknowledgements

Citation

Contact

About

Releases

Packages

Languages

License

vision-x-nyu/pisa-experiments

Folders and files

Latest commit

History

Repository files navigation

Pisa Experiments:Exploring Physics Post-Training for Video Diffusion Models by Watching Stuff Drop

Release

Contents

Installation

Evaluation

Simulation

Post-Training

PisaBench

Real World Videos

Simulated Test Videos

Evaluation

Evaluation Results

Data Simulation

Post-Training

Stage 1: Physics Supervised Fine-Tuning (PSFT)

Stage 2: Object Reward Optimization (ORO)

Acknowledgements

Citation

Contact

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Pisa Experiments:
Exploring Physics Post-Training for Video Diffusion Models by Watching Stuff Drop

Packages