Hello Team,
I am struggling to understand the complete pipeline for RL in a personal agent setting in Openclaw.
Generally, OpenClaw uses a single model to power its backend. In my understanding, this is typically a frontier model such as Claude Opus, GPT-4, or DeepSeek to achieve high performance.
In the paper, you train a policy model (~Qwen8b) to adapt to user tasks. Are you actually optimizing the backend LLM powering OpenClaw, or is this an additional model sitting alongside the backend frontier model (that means two models)? I think it is the former.
Could you please clarify how many LLMs are involved during OpenClaw interactions? One policy LLM (that is being trained) that is powering the openclaw using an additional model-based reward?
Please correct me if I am wrong.
Thanks!
Hello Team,
I am struggling to understand the complete pipeline for RL in a personal agent setting in Openclaw.
Generally, OpenClaw uses a single model to power its backend. In my understanding, this is typically a frontier model such as Claude Opus, GPT-4, or DeepSeek to achieve high performance.
In the paper, you train a policy model (~Qwen8b) to adapt to user tasks. Are you actually optimizing the backend LLM powering OpenClaw, or is this an additional model sitting alongside the backend frontier model (that means two models)? I think it is the former.
Could you please clarify how many LLMs are involved during OpenClaw interactions? One policy LLM (that is being trained) that is powering the openclaw using an additional model-based reward?
Please correct me if I am wrong.
Thanks!