-
Notifications
You must be signed in to change notification settings - Fork 28
Ray actor loop with local envs #81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Should not be a huge issue as we spawn a whole new deno instance for every new task. So the single failure will result in only one failed trace. But we should definitely monitor number of such cases when running actor. |
|
I've added field |
| ) | ||
| llm = self.llms_by_url[llm_url] | ||
| task_ref = self.ray_remote.remote(self.cfg_dict, llm, problem_batch, self.problem_id) | ||
| time.sleep(1.0) # TODO: remove this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not forget to remove this. It's capping us at 1 batch/sec per LLM I guess?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it's to spread out in time tasks submissions to workers. We submit 1 task a second up to 255 with current configuration, then they all run in parallel. Assuming avg task latency ~100 sec, we effectively running around 100 tasks in parallel with even with this slowdown,
Ray-based implementation of
ActorLoopthat replaces multiprocessing and in memory queues.Task Execution
ray.remote()instead ofmultiprocessing.Processasync_batch_sizeproblems in an async loop simultaneously.Load Balancing
Queue Management
SharedMemoryQueuewith in-memory lists, as Ray handles passing results between processes on its ownray.wait()to poll for finished tasks (up to 100 at a time)Monitoring
Method Overrides
start_backend(): Initialize Ray runtimehave_capacity(): Check task count + per-LLM limitssubmit_problem(): Create Ray tasks for each attemptget_new_results(): Poll Ray and return completed groupsstop_tasks(): Shutdown RayConfiguration
Enabled via
cfg.use_ray=truein config. Selected automatically inrun_actor_loop().MCP Server config