| title | Api Tool Planner Environment Server | |
|---|---|---|
| emoji | 🧠 | |
| colorFrom | purple | |
| colorTo | blue | |
| sdk | docker | |
| pinned | false | |
| app_port | 8000 | |
| base_path | /web | |
| tags |
|
Submission by Aman Agrawal | Meta PyTorch x Scaler Hackathon — Round 1
Every day, SREs get paged at 3 AM because a service is on fire. Data engineers discover their pipeline dumped garbage into production. Support agents fumble through five different tools trying to find a customer's ticket. Security analysts race to contain a breach before credentials leak.
In all these scenarios, the human knows what needs to happen — but the execution requires calling the right APIs, in the right order, with the right parameters. Miss one step and the incident escalates. Call the wrong endpoint and you make things worse.
What if an LLM agent could handle this?
That's the premise of the API Tool Planner. It's an RL environment that trains and evaluates LLM agents on the single most important skill they need in production: choosing the correct API tool and parameters from a catalogue of plausible alternatives, chaining multiple calls together, and extracting information from one response to feed into the next.
This isn't a toy environment. Every task is modelled after a real operational scenario that costs companies real money when it goes wrong.
An OpenEnv-compliant RL environment with 15 tasks across 5 real-world operational domains, each with dense per-step rewards, distractor tools, and progressively harder challenges.
| Domain | What the agent does | Why it matters |
|---|---|---|
| Incident Response | Diagnose service outages, scale infrastructure, rollback bad deploys | A 10-minute delay in incident response can cost $10K+ |
| Data Pipeline Ops | Investigate failed ETL runs, retry from failed stages, pause pipelines | Bad data flowing downstream is worse than no data |
| Customer Support | Look up accounts, apply refunds, escalate tickets | Customer churn from bad support costs 5x acquisition |
| Security Ops | Triage alerts, quarantine compromised hosts, revoke credentials | Every minute a breach goes uncontained, exposure grows |
| Cloud Infrastructure | Provision VMs, set up DNS, configure load balancers | Misconfigurations cause ~70% of cloud outages |
| Difficulty | Steps | What makes it hard |
|---|---|---|
| Easy (5 tasks) | 1 call | Pick the right tool from 8 options including distractors |
| Medium (5 tasks) | 2 calls | Correct ordering matters — scale then open incident, not the reverse |
| Hard (5 tasks) | 3 calls | ID chaining — extract IDs from API responses to use in subsequent calls. Plus distractor traps. |
"A customer with email jane.doe@example.com has called back three times about an unresolved issue. Find their account, get their open tickets, and escalate the most recent one to Tier 2."
The agent must:
- Call
lookup_customer(email="jane.doe@example.com")→ response containscustomer_id: "CUST-5512" - Extract
CUST-5512and callget_customer_tickets(customer_id="CUST-5512", status="open")→ response containsticket_id: "TKT-8801" - Extract
TKT-8801and callescalate_ticket(ticket_id="TKT-8801", target_tier="Tier 2", reason="...")
The agent can't know these IDs in advance. It must read API responses and chain them. Meanwhile, 5 other tools in the catalogue (like merge_customers or close_ticket) look tempting but are wrong.
Most environments give binary rewards: 1 for correct, 0 for wrong. That's terrible for learning. My environment gives granular partial credit at every step:
| Component | Weight | What it rewards |
|---|---|---|
| Correct tool name | 35% | Picked rollback_service instead of restart_service |
| Required params present | 35% | Included service_name and target_version |
| Parameter values accurate | 30% | Values match what the task specified |
| Efficiency bonus | +5% | Completed the task without wasting steps |
An agent that picks the right tool but forgets a parameter gets ~0.70. One that picks the right tool with right params but wrong values gets ~0.85. This gradient is what makes RL training possible.
Three stochastic features prevent agents from gaming the environment:
- Tool catalogue shuffling — The 8 tools appear in random order every episode. No position-based shortcuts.
- Response jitter — Numeric fields (CPU %, latency, etc.) vary ±5% each episode. Same task, slightly different numbers.
- ID preservation — IDs needed for chaining (
CUST-5512,TKT-8801) are never jittered. The agent must still read them from responses.
For free-form text parameters like reason or description, exact string matching is too strict. My grader uses Jaccard token overlap:
Expected: "checkout-service degraded, scaled to 5 replicas"
Agent said: "scaled checkout-service to 5 replicas due to degradation"
Token overlap: 0.72 → Score: 0.86 (not zero!)
For array parameters (like load balancer targets), set comparison — order doesn't matter:
Expected: ["staging-api-01", "prod-api-01"]
Agent said: ["prod-api-01", "staging-api-01"]
Score: 1.0 internally (perfect — same set)
Each domain has carefully designed distractors:
| Domain | Distractor | Why it's tempting | Why it's wrong |
|---|---|---|---|
| Incident | restart_service |
Sounds like it fixes things | Restarts the same broken version |
| Pipeline | delete_pipeline_run |
Removes the failed run | Destructive and irreversible |
| Security | block_ip |
Blocks the attacker's IP | Doesn't contain the compromised host |
| Cloud | resize_vm |
Changes a VM's resources | Resizes existing VM, doesn't create a new one |
The environment gracefully handles malformed inputs with specific, helpful feedback:
| Input | Response |
|---|---|
Empty tool name "" |
"Empty tool name provided. Choose a tool from the catalogue." |
Unknown tool "foobar" |
"Unknown tool 'foobar'. Not in the available catalogue." |
| Wrong tool (but valid) | "Wrong tool 'restart_service'. Expected a different API call." |
| Step without reset | "No active episode. Call reset() first." |
inference.py (Agent script — Oracle or LLM)
│
├── models.py (Pydantic: Action + Observation contracts)
├── server/tasks.py (15 tasks, 5 tool catalogues, simulated responses)
└── server/environment.py (Core RL logic: reset, step, grade, jitter)
│
└── server/app.py (FastAPI: /reset /step /tasks /grader /baseline)
│
└── Dockerfile + openenv.yaml (Deployment)
| File | Lines | Purpose |
|---|---|---|
server/tasks.py |
288 | 15 task definitions across 5 domains with tool catalogues and simulated responses |
server/api_tool_planner_environment.py |
335 | Core environment: reset, step, reward calculation, value comparison, response jitter |
inference.py |
276 | Oracle + LLM agents, strict [START]/[STEP]/[END] stdout format |
server/app.py |
134 | FastAPI server with custom /tasks, /grader, /baseline, /health endpoints |
models.py |
50 | Typed Pydantic models for Action (tool_name + params) and Observation (11 fields) |
client.py |
79 | WebSocket client for remote environment interaction |
tests/test_environment.py |
415 | 46 unit tests — all passing |
$ uv run python -m pytest tests/test_environment.py -v
============================= 46 passed in 1.66s ==============================
Tests cover: all 15 tasks (perfect scores), reward mechanics (partial credit, efficiency bonus, token overlap, list comparison), stochastic features (shuffling, jitter, ID preservation), edge cases (empty tools, unknown tools, no-reset errors), and episode lifecycle.
$ uv run python inference.py --oracle
[START] task=incident_easy env=api_tool_planner model=oracle
[STEP] step=1 action=call(get_service_metrics,...) reward=0.99 done=true error=null
[END] success=true steps=1 score=0.99 rewards=0.99
...
============================================================
Agent: oracle
Tasks: 15 | Average score: 0.9900
============================================================
$ uv run openenv validate
[OK] api_tool_planner: Ready for multi-mode deployment
# Install dependencies (include dev for tests, inference for OpenAI client)
cd api_tool_planner
uv sync --extra dev --extra inference
# Run tests
uv run python -m pytest tests/test_environment.py -v
# Run oracle baseline (no API key needed)
uv run python inference.py --oracle
# Run LLM agent
export HF_TOKEN=hf_...
uv run python inference.py
# Start the server
uv run server --host 0.0.0.0 --port 8000
# Then open http://localhost:8000/docs for Swagger UI
# Validate
uv run openenv validate
# Docker
docker build -t api-tool-planner .
docker run -p 8000:8000 api-tool-planner| Endpoint | Method | Description |
|---|---|---|
/reset |
POST | Start a new episode |
/step |
POST | Execute a tool-call action |
/state |
GET | Current episode metadata |
/tasks |
GET | List all 15 tasks |
/grader |
POST | Score a single action (stateless) |
/baseline |
GET | Run oracle across all tasks |
/health |
GET | Health check |
/ws |
WS | WebSocket for stateful sessions |
/docs |
GET | Auto-generated Swagger UI |
step_reward = (tool_match × 0.35 + params_present × 0.35 + param_values × 0.30) / num_expected_calls
- Total episode reward: strictly within (0, 1) — never exactly 0.0 or 1.0
- Efficiency bonus: +0.05 if completed within
num_calls + 1steps - Cumulative reward: capped at 0.99
Parameter value scoring:
- Exact match → 1.0
- Token overlap ≥ 50% → 0.75–1.0
- Substring match → 0.5
- Numeric within ±10% → 0.5
- List/set match (order-independent) → 1.0
- No match → 0.0
api_tool_planner/
├── README.md # This file
├── STUDENT_README.md # Detailed technical documentation
├── learn.md # Learning roadmap (prerequisites + codebase walkthrough)
├── openenv.yaml # OpenEnv spec configuration
├── pyproject.toml # Dependencies and entry points
├── Dockerfile # Multi-stage Docker build
├── inference.py # Mandatory inference script (Oracle + LLM)
├── models.py # Pydantic Action/Observation models
├── client.py # WebSocket client
├── __init__.py # Package exports
├── server/
│ ├── __init__.py
│ ├── api_tool_planner_environment.py # Core environment (reset/step/grade)
│ ├── tasks.py # 15 task definitions across 5 domains
│ └── app.py # FastAPI application + custom endpoints
├── tests/
│ ├── __init__.py
│ └── test_environment.py # 46 unit tests
└── scripts/
└── validate-submission.sh # Pre-submission validator
Built with OpenEnv by Meta PyTorch. Submitted for the Scaler School of Technology Hackathon, Round 1.