Train an inverse dynamics model that predicts per-frame actions from screen-recording sequences.
uv sync
# optional test deps
uv sync --extra dev
uv run pre-commit installFor screen-recording data, we use crowd-cast. The data is expected in the following form:
crowd_cast_root/
.../<user_id>/recordings/recording_<session-id>_seg####.mp4
.../<user_id>/keylogs/input_<session-id>_seg####.msgpack
Preprocess the crowd-cast data into ArrayRecord format for IDM training:
uv run python -m data.video_to_array_records \
--input-path /path/to/crowd_cast_root \
--output-path /path/to/idm_data \
--target-width 160 \
--target-height 90 \
--target-fps 10 \
--top-bar-fraction 0.15 \
--black-ratio 0.95 \
--chunk-size 160 \
--chunks-per-file 100 \
--num-workers 16The generated data directory looks like:
idm_data/
metadata.json
train/*.array_record
val/*.array_record
test/*.array_record
Lumine-inspired action format (per frame, see Lumine):
NO_OPMOUSE:dx,dy,dzMOUSE:dx,dy,dz ; <pressed_keys>
dx,dy,dz are per-frame relative mouse deltas (quantized/clipped during preprocessing), and pressed keys are appended as a sorted space-separated list when present.
Single GPU (baseline):
torchrun --nproc_per_node=1 -m idm.train \
--model-id Qwen/Qwen3-VL-2B-Instruct \
--data-root /path/to/idm_data \
--image-h 90 --image-w 160 --image-c 3 \
--seq-len 32 \
--global-batch-size 8 \
--grad-accum 1 \
--max-steps 3000 \
--lr 2e-5 \
--lr-schedule wsd \
--warmup-steps 200 \
--wsd-decay-steps 600 \
--precision bf16 \
--use-lora True \
--wandb-enable True \
--wandb-project idm \
--wandb-run-name idm_qwen2b_baseline \
--out-dir ./runs/idm_qwen2bIf you are not using wandb, set --wandb-enable False.
Multi-GPU (example: 8 GPUs):
torchrun --nproc_per_node=8 -m idm.train \
--data-root /path/to/idm_data \
--global-batch-size 64 \
--out-dir ./runs/idm_8gpu \
--wandb-enable FalseResume:
torchrun --nproc_per_node=8 -m idm.train --data-root /path/to/idm_data --resume-from latestCheckpoints are written under out_dir/checkpoints/.
uv run pytest tests/