Skip to content

Conversation

@ollmer
Copy link
Collaborator

@ollmer ollmer commented Oct 6, 2025

Ray-based implementation of ActorLoop that replaces multiprocessing and in memory queues.

Task Execution

  • Uses ray.remote() instead of multiprocessing.Process
  • Initializes Ray with configurable worker count and dashboard
  • Tasks execute rollout policy in separate processes, 1 process per CPU. One ray task handles async_batch_size problems in an async loop simultaneously.

Load Balancing

  • Tracks number of tasks assigned per LLM URL
  • Submits tasks to least busy LLM
  • Checks capacity constraints per LLM before submission

Queue Management

  • Replaces SharedMemoryQueue with in-memory lists, as Ray handles passing results between processes on its own
  • Uses ray.wait() to poll for finished tasks (up to 100 at a time)
  • Groups results by problem ID before returning

Monitoring

  • Logs task latencies, Ray overhead, token throughput, number of failed rollouts
  • Reports per-LLM utilization

Method Overrides

  • start_backend(): Initialize Ray runtime
  • have_capacity(): Check task count + per-LLM limits
  • submit_problem(): Create Ray tasks for each attempt
  • get_new_results(): Poll Ray and return completed groups
  • stop_tasks(): Shutdown Ray
  • Queue size methods adapted for in-memory lists tracking

Configuration

Enabled via cfg.use_ray=true in config. Selected automatically in run_actor_loop().

MCP Server config

  • server startup command replaced with the shorter one that expects the mcp-run-python module to be installed. Lack of installation during startups speeds up actor loop significantly as this startup one time per each task.

@ollmer
Copy link
Collaborator Author

ollmer commented Oct 14, 2025

We need to investigate these, probably deno instances dying. This is a multi-node full training run

Should not be a huge issue as we spawn a whole new deno instance for every new task. So the single failure will result in only one failed trace. But we should definitely monitor number of such cases when running actor.

@ollmer
Copy link
Collaborator Author

ollmer commented Oct 14, 2025

I've added field rollout_errors to the metrics sent to wandb to monitor number of such errors

)
llm = self.llms_by_url[llm_url]
task_ref = self.ray_remote.remote(self.cfg_dict, llm, problem_batch, self.problem_id)
time.sleep(1.0) # TODO: remove this
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not forget to remove this. It's capping us at 1 batch/sec per LLM I guess?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's to spread out in time tasks submissions to workers. We submit 1 task a second up to 255 with current configuration, then they all run in parallel. Assuming avg task latency ~100 sec, we effectively running around 100 tasks in parallel with even with this slowdown,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants