Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
66 changes: 64 additions & 2 deletions docs/source/openenv.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ In this guide, we’ll focus on **how to integrate OpenEnv with TRL**, but feel
To use OpenEnv with TRL, install the framework:

```bash
pip install openenv-core
pip install git+https://github.com/meta-pytorch/OpenEnv.git
```

## Using `rollout_func` with OpenEnv environments
Expand Down Expand Up @@ -65,6 +65,33 @@ By using OpenEnv in this loop, you can:
* Plug in custom simulators, web APIs, or evaluators as environments.
* Pass structured reward signals back into RL training seamlessly.

## Running the Environments

You can run OpenEnv environments in three different ways:

1. **Local Docker container** *(recommended)*

To start a Docker container:
* Open the environment on the Hugging Face Hub.
* Click the **⋮ (three dots)** menu.
* Select **“Run locally.”**
* Copy and execute the provided command in your terminal.

Example:
```bash
docker run -d -p 8001:8001 registry.hf.space/openenv-echo-env:latest
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that this port mapping will work because the env doesn't use 8001, it uses 8000.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8000 is for vLLM in the snippets we provide 🤔

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for the late reply. I meant that the internal port in the mapping needs to match what the container is using (8000). the external post can be changed to match the host network.

```
![open_env_launch_docker](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/open_env_launch_docker.png)
2. **Local Python process**: Launch the environment directly using Uvicorn.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user will need to know to be in the OpenEnv repo for this to work.

You can start the server manually as a local process. For more details about the available environments, refer to the [OpenEnv repository](https://github.com/meta-pytorch/OpenEnv/tree/main/src/envs).
```bash
python -m uvicorn envs.echo_env.server.app:app --host 0.0.0.0 --port 8001
```
3. **Hugging Face Spaces**: Connect to a hosted environment running on the Hugging Face Hub.
To find the connection URL, open the Space page, click the **⋮ (three dots)** menu, and select **“Embed this Space.”**
You can then use that URL to connect directly from your client.
Keep in mind that public Spaces may have rate limits or temporarily go offline if inactive.

## A simple example

The [echo.py](https://github.com/huggingface/trl/blob/main/examples/scripts/openenv/echo.py) script demonstrates a minimal, end-to-end integration between TRL and OpenEnv. In this example, the Echo environment rewards completions based on their text length, encouraging the model to generate longer outputs. This pattern can be extended to any custom environment that provides structured feedback or task-based rewards:
Expand All @@ -75,6 +102,15 @@ from trl import GRPOConfig, GRPOTrainer

# Create HTTP client for Echo Environment
client = EchoEnv.from_docker_image("echo-env:latest")
"""
Alternatively, you can start the environment manually with Docker and connect to it:

# Step 1: Start the Echo environment
docker run -d -p 8001:8001 registry.hf.space/openenv-echo-env:latest

# Step 2: Connect the client to the running container
client = EchoEnv(base_url="http://0.0.0.0:8001")
"""

def rollout_func(prompts, args, processing_class):
# 1. Generate completions via vLLM inference server (running on port 8000)
Expand Down Expand Up @@ -151,6 +187,21 @@ CUDA_VISIBLE_DEVICES=0 trl vllm-serve --model Qwen/Qwen2.5-0.5B-Instruct --host
CUDA_VISIBLE_DEVICES=1 python examples/scripts/openenv/echo.py
```

Alternatively, you can manually start the Echo environment in a Docker container before running the training:

```bash
# Launch the Echo environment
docker run -d -p 8001:8001 registry.hf.space/openenv-echo-env:latest
Copy link
Collaborator

@burtenshaw burtenshaw Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as above, I think we should double check this port mapping.

```

Then, initialize the client using:

`client = EchoEnv(base_url="http://0.0.0.0:8001")`

instead of:

`client = EchoEnv.from_docker_image("echo-env:latest")`.

Below is the reward curve from training:

<iframe src="https://trl-lib-trackio.hf.space?project=openenv&metrics=train/rewards/reward_from_env/mean&runs=qgallouedec-1761202871&sidebar=hidden&navbar=hidden" style="width:600px; height:500px; border:0;"></iframe>
Expand Down Expand Up @@ -352,7 +403,7 @@ trainer = GRPOTrainer(
trainer.train()
```

### Running the Example
### Running the Advanced Example

The example requires two GPUs:

Expand All @@ -364,6 +415,17 @@ CUDA_VISIBLE_DEVICES=0 trl vllm-serve --model Qwen/Qwen3-1.7B --host 0.0.0.0 --p
CUDA_VISIBLE_DEVICES=1 python examples/scripts/openenv/wordle.py
```

Again, you can manually start the TextArena environment in a Docker container before running the training.
In this case, initialize the client with
`client = TextArenaEnv(base_url="http://0.0.0.0:8001")`
instead of
`client = TextArenaEnv.from_docker_image("registry.hf.space/burtenshaw-textarena:latest")`:

```bash
# Launch the TextArena environment
docker run -d -p 8001:8001 registry.hf.space/burtenshaw-textarena:latest
```

### Results

The resulting model improves it's performance on the game, both by reducing the number of repetitions and by increasing the number of correct guesses. However, the the Qwen3-1.7B model we trained is not able to consistently win the game. The following reward curve shows the coverage of the model's guesses and the coverage of correct Y and G letters.
Expand Down
Loading
Loading