Skip to content

Clarification about the pipeline #107

@kknakka2026-cmd

Description

@kknakka2026-cmd

Hello Team,

I am struggling to understand the complete pipeline for RL in a personal agent setting in Openclaw.

Generally, OpenClaw uses a single model to power its backend. In my understanding, this is typically a frontier model such as Claude Opus, GPT-4, or DeepSeek to achieve high performance.

In the paper, you train a policy model (~Qwen8b) to adapt to user tasks. Are you actually optimizing the backend LLM powering OpenClaw, or is this an additional model sitting alongside the backend frontier model (that means two models)? I think it is the former.

Could you please clarify how many LLMs are involved during OpenClaw interactions? One policy LLM (that is being trained) that is powering the openclaw using an additional model-based reward?

Please correct me if I am wrong.

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions