API Tool Planner: Teaching LLMs to Think Before They Call

title

Api Tool Planner Environment Server

emoji

🧠

colorFrom

purple

colorTo

blue

sdk

docker

pinned

false

app_port

8000

base_path

/web

API Tool Planner: Teaching LLMs to Think Before They Call

Submission by Aman Agrawal | Meta PyTorch x Scaler Hackathon — Round 1

The Problem I Set Out to Solve

Every day, SREs get paged at 3 AM because a service is on fire. Data engineers discover their pipeline dumped garbage into production. Support agents fumble through five different tools trying to find a customer's ticket. Security analysts race to contain a breach before credentials leak.

In all these scenarios, the human knows what needs to happen — but the execution requires calling the right APIs, in the right order, with the right parameters. Miss one step and the incident escalates. Call the wrong endpoint and you make things worse.

What if an LLM agent could handle this?

That's the premise of the API Tool Planner. It's an RL environment that trains and evaluates LLM agents on the single most important skill they need in production: choosing the correct API tool and parameters from a catalogue of plausible alternatives, chaining multiple calls together, and extracting information from one response to feed into the next.

This isn't a toy environment. Every task is modelled after a real operational scenario that costs companies real money when it goes wrong.

What I Built

An OpenEnv-compliant RL environment with 15 tasks across 5 real-world operational domains, each with dense per-step rewards, distractor tools, and progressively harder challenges.

The 5 Domains

Domain	What the agent does	Why it matters
Incident Response	Diagnose service outages, scale infrastructure, rollback bad deploys	A 10-minute delay in incident response can cost $10K+
Data Pipeline Ops	Investigate failed ETL runs, retry from failed stages, pause pipelines	Bad data flowing downstream is worse than no data
Customer Support	Look up accounts, apply refunds, escalate tickets	Customer churn from bad support costs 5x acquisition
Security Ops	Triage alerts, quarantine compromised hosts, revoke credentials	Every minute a breach goes uncontained, exposure grows
Cloud Infrastructure	Provision VMs, set up DNS, configure load balancers	Misconfigurations cause ~70% of cloud outages

Progressive Difficulty (Easy → Medium → Hard)

Difficulty	Steps	What makes it hard
Easy (5 tasks)	1 call	Pick the right tool from 8 options including distractors
Medium (5 tasks)	2 calls	Correct ordering matters — scale then open incident, not the reverse
Hard (5 tasks)	3 calls	ID chaining — extract IDs from API responses to use in subsequent calls. Plus distractor traps.

Example: The Hardest Task (`support_hard`)

"A customer with email jane.doe@example.com has called back three times about an unresolved issue. Find their account, get their open tickets, and escalate the most recent one to Tier 2."

The agent must:

Call lookup_customer(email="jane.doe@example.com") → response contains customer_id: "CUST-5512"
Extract CUST-5512 and call get_customer_tickets(customer_id="CUST-5512", status="open") → response contains ticket_id: "TKT-8801"
Extract TKT-8801 and call escalate_ticket(ticket_id="TKT-8801", target_tier="Tier 2", reason="...")

The agent can't know these IDs in advance. It must read API responses and chain them. Meanwhile, 5 other tools in the catalogue (like merge_customers or close_ticket) look tempting but are wrong.

What Makes This Submission Different

1. Dense Reward Shaping — Not Just Right or Wrong

Most environments give binary rewards: 1 for correct, 0 for wrong. That's terrible for learning. My environment gives granular partial credit at every step:

Component	Weight	What it rewards
Correct tool name	35%	Picked `rollback_service` instead of `restart_service`
Required params present	35%	Included `service_name` and `target_version`
Parameter values accurate	30%	Values match what the task specified
Efficiency bonus	+5%	Completed the task without wasting steps

An agent that picks the right tool but forgets a parameter gets ~0.70. One that picks the right tool with right params but wrong values gets ~0.85. This gradient is what makes RL training possible.

2. Anti-Memorisation by Design

Three stochastic features prevent agents from gaming the environment:

Tool catalogue shuffling — The 8 tools appear in random order every episode. No position-based shortcuts.
Response jitter — Numeric fields (CPU %, latency, etc.) vary ±5% each episode. Same task, slightly different numbers.
ID preservation — IDs needed for chaining (CUST-5512, TKT-8801) are never jittered. The agent must still read them from responses.

3. Intelligent Grading — Token Overlap, Not Just Exact Match

For free-form text parameters like reason or description, exact string matching is too strict. My grader uses Jaccard token overlap:

Expected:  "checkout-service degraded, scaled to 5 replicas"
Agent said: "scaled checkout-service to 5 replicas due to degradation"
Token overlap: 0.72 → Score: 0.86 (not zero!)

For array parameters (like load balancer targets), set comparison — order doesn't matter:

Expected:  ["staging-api-01", "prod-api-01"]
Agent said: ["prod-api-01", "staging-api-01"]
Score: 1.0 internally (perfect — same set)

4. Distractor Traps That Test Real Understanding

Each domain has carefully designed distractors:

Domain	Distractor	Why it's tempting	Why it's wrong
Incident	`restart_service`	Sounds like it fixes things	Restarts the same broken version
Pipeline	`delete_pipeline_run`	Removes the failed run	Destructive and irreversible
Security	`block_ip`	Blocks the attacker's IP	Doesn't contain the compromised host
Cloud	`resize_vm`	Changes a VM's resources	Resizes existing VM, doesn't create a new one

5. Robust Edge-Case Handling

The environment gracefully handles malformed inputs with specific, helpful feedback:

Input	Response
Empty tool name `""`	`"Empty tool name provided. Choose a tool from the catalogue."`
Unknown tool `"foobar"`	`"Unknown tool 'foobar'. Not in the available catalogue."`
Wrong tool (but valid)	`"Wrong tool 'restart_service'. Expected a different API call."`
Step without reset	`"No active episode. Call reset() first."`

Technical Architecture

inference.py                    (Agent script — Oracle or LLM)
    │
    ├── models.py               (Pydantic: Action + Observation contracts)
    ├── server/tasks.py          (15 tasks, 5 tool catalogues, simulated responses)
    └── server/environment.py    (Core RL logic: reset, step, grade, jitter)
            │
            └── server/app.py    (FastAPI: /reset /step /tasks /grader /baseline)
                    │
                    └── Dockerfile + openenv.yaml  (Deployment)

Files at a Glance

File	Lines	Purpose
`server/tasks.py`	288	15 task definitions across 5 domains with tool catalogues and simulated responses
`server/api_tool_planner_environment.py`	335	Core environment: reset, step, reward calculation, value comparison, response jitter
`inference.py`	276	Oracle + LLM agents, strict `[START]/[STEP]/[END]` stdout format
`server/app.py`	134	FastAPI server with custom `/tasks`, `/grader`, `/baseline`, `/health` endpoints
`models.py`	50	Typed Pydantic models for Action (tool_name + params) and Observation (11 fields)
`client.py`	79	WebSocket client for remote environment interaction
`tests/test_environment.py`	415	46 unit tests — all passing

Verification — Everything Works

Tests: 46/46 passing

$ uv run python -m pytest tests/test_environment.py -v
============================= 46 passed in 1.66s ==============================

Tests cover: all 15 tasks (perfect scores), reward mechanics (partial credit, efficiency bonus, token overlap, list comparison), stochastic features (shuffling, jitter, ID preservation), edge cases (empty tools, unknown tools, no-reset errors), and episode lifecycle.

Oracle Baseline: 15/15 tasks at 0.99

$ uv run python inference.py --oracle
[START] task=incident_easy env=api_tool_planner model=oracle
[STEP] step=1 action=call(get_service_metrics,...) reward=0.99 done=true error=null
[END] success=true steps=1 score=0.99 rewards=0.99
...
============================================================
  Agent: oracle
  Tasks: 15  |  Average score: 0.9900
============================================================

OpenEnv Validate: Passing

$ uv run openenv validate
[OK] api_tool_planner: Ready for multi-mode deployment

How to Run

# Install dependencies (include dev for tests, inference for OpenAI client)
cd api_tool_planner
uv sync --extra dev --extra inference

# Run tests
uv run python -m pytest tests/test_environment.py -v

# Run oracle baseline (no API key needed)
uv run python inference.py --oracle

# Run LLM agent
export HF_TOKEN=hf_...
uv run python inference.py

# Start the server
uv run server --host 0.0.0.0 --port 8000
# Then open http://localhost:8000/docs for Swagger UI

# Validate
uv run openenv validate

# Docker
docker build -t api-tool-planner .
docker run -p 8000:8000 api-tool-planner

API Endpoints

Endpoint	Method	Description
`/reset`	POST	Start a new episode
`/step`	POST	Execute a tool-call action
`/state`	GET	Current episode metadata
`/tasks`	GET	List all 15 tasks
`/grader`	POST	Score a single action (stateless)
`/baseline`	GET	Run oracle across all tasks
`/health`	GET	Health check
`/ws`	WS	WebSocket for stateful sessions
`/docs`	GET	Auto-generated Swagger UI

The Reward Function

step_reward = (tool_match × 0.35 + params_present × 0.35 + param_values × 0.30) / num_expected_calls

Total episode reward: strictly within (0, 1) — never exactly 0.0 or 1.0
Efficiency bonus: +0.05 if completed within num_calls + 1 steps
Cumulative reward: capped at 0.99

Parameter value scoring:

Exact match → 1.0
Token overlap ≥ 50% → 0.75–1.0
Substring match → 0.5
Numeric within ±10% → 0.5
List/set match (order-independent) → 1.0
No match → 0.0

Project Structure

api_tool_planner/
├── README.md                          # This file
├── STUDENT_README.md                  # Detailed technical documentation
├── learn.md                           # Learning roadmap (prerequisites + codebase walkthrough)
├── openenv.yaml                       # OpenEnv spec configuration
├── pyproject.toml                     # Dependencies and entry points
├── Dockerfile                         # Multi-stage Docker build
├── inference.py                       # Mandatory inference script (Oracle + LLM)
├── models.py                          # Pydantic Action/Observation models
├── client.py                          # WebSocket client
├── __init__.py                        # Package exports
├── server/
│   ├── __init__.py
│   ├── api_tool_planner_environment.py  # Core environment (reset/step/grade)
│   ├── tasks.py                         # 15 task definitions across 5 domains
│   └── app.py                           # FastAPI application + custom endpoints
├── tests/
│   ├── __init__.py
│   └── test_environment.py              # 46 unit tests
└── scripts/
    └── validate-submission.sh           # Pre-submission validator

Built with OpenEnv by Meta PyTorch. Submitted for the Scaler School of Technology Hackathon, Round 1.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

API Tool Planner: Teaching LLMs to Think Before They Call

The Problem I Set Out to Solve

What I Built

The 5 Domains

Progressive Difficulty (Easy → Medium → Hard)

Example: The Hardest Task (`support_hard`)

What Makes This Submission Different

1. Dense Reward Shaping — Not Just Right or Wrong

2. Anti-Memorisation by Design

3. Intelligent Grading — Token Overlap, Not Just Exact Match

4. Distractor Traps That Test Real Understanding

5. Robust Edge-Case Handling

Technical Architecture

Files at a Glance

Verification — Everything Works

Tests: 46/46 passing

Oracle Baseline: 15/15 tasks at 0.99

OpenEnv Validate: Passing

How to Run

API Endpoints

The Reward Function

Project Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
scripts		scripts
server		server
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
__init__.py		__init__.py
client.py		client.py
inference.py		inference.py
models.py		models.py
openenv.yaml		openenv.yaml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

API Tool Planner: Teaching LLMs to Think Before They Call

The Problem I Set Out to Solve

What I Built

The 5 Domains

Progressive Difficulty (Easy → Medium → Hard)

Example: The Hardest Task (support_hard)

What Makes This Submission Different

1. Dense Reward Shaping — Not Just Right or Wrong

2. Anti-Memorisation by Design

3. Intelligent Grading — Token Overlap, Not Just Exact Match

4. Distractor Traps That Test Real Understanding

5. Robust Edge-Case Handling

Technical Architecture

Files at a Glance

Verification — Everything Works

Tests: 46/46 passing

Oracle Baseline: 15/15 tasks at 0.99

OpenEnv Validate: Passing

How to Run

API Endpoints

The Reward Function

Project Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Example: The Hardest Task (`support_hard`)

Packages