Clarification about the pipeline

Hello Team,

I am struggling to understand the complete pipeline for RL in a personal agent setting in Openclaw. 

Generally, OpenClaw uses a single model to power its backend. In my understanding, this is typically a frontier model such as Claude Opus, GPT-4, or DeepSeek to achieve high performance.   

In the paper, you train a policy model (~Qwen8b) to adapt to user tasks. Are you actually optimizing the backend LLM powering OpenClaw, or is this an additional model sitting alongside the backend frontier model (that means two models)? I think it is the former. 

Could you please clarify how many LLMs are involved during OpenClaw interactions?  One policy LLM (that is being trained) that is powering the openclaw using an additional model-based reward? 

Please correct me if I am wrong.

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarification about the pipeline #107

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Clarification about the pipeline #107

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions