Release of the evaluation dataset for world state model

Hi, authors, wonderful work.

I notice you have validated the world state model's performance on three datasets:

- agentrewardbench: already open-sourced
- OSworld-full trajectories
- Prof/Office trajectories

Since we are trying to make a fair comparison, your release of the last two evaluation datasets will help us a lot.