Skip to content

HiAmanAgrawal/api_tool_planner_rle

Repository files navigation

title Api Tool Planner Environment Server
emoji 🧠
colorFrom purple
colorTo blue
sdk docker
pinned false
app_port 8000
base_path /web
tags
openenv

API Tool Planner: Teaching LLMs to Think Before They Call

Submission by Aman Agrawal | Meta PyTorch x Scaler Hackathon — Round 1


The Problem I Set Out to Solve

Every day, SREs get paged at 3 AM because a service is on fire. Data engineers discover their pipeline dumped garbage into production. Support agents fumble through five different tools trying to find a customer's ticket. Security analysts race to contain a breach before credentials leak.

In all these scenarios, the human knows what needs to happen — but the execution requires calling the right APIs, in the right order, with the right parameters. Miss one step and the incident escalates. Call the wrong endpoint and you make things worse.

What if an LLM agent could handle this?

That's the premise of the API Tool Planner. It's an RL environment that trains and evaluates LLM agents on the single most important skill they need in production: choosing the correct API tool and parameters from a catalogue of plausible alternatives, chaining multiple calls together, and extracting information from one response to feed into the next.

This isn't a toy environment. Every task is modelled after a real operational scenario that costs companies real money when it goes wrong.


What I Built

An OpenEnv-compliant RL environment with 15 tasks across 5 real-world operational domains, each with dense per-step rewards, distractor tools, and progressively harder challenges.

The 5 Domains

Domain What the agent does Why it matters
Incident Response Diagnose service outages, scale infrastructure, rollback bad deploys A 10-minute delay in incident response can cost $10K+
Data Pipeline Ops Investigate failed ETL runs, retry from failed stages, pause pipelines Bad data flowing downstream is worse than no data
Customer Support Look up accounts, apply refunds, escalate tickets Customer churn from bad support costs 5x acquisition
Security Ops Triage alerts, quarantine compromised hosts, revoke credentials Every minute a breach goes uncontained, exposure grows
Cloud Infrastructure Provision VMs, set up DNS, configure load balancers Misconfigurations cause ~70% of cloud outages

Progressive Difficulty (Easy → Medium → Hard)

Difficulty Steps What makes it hard
Easy (5 tasks) 1 call Pick the right tool from 8 options including distractors
Medium (5 tasks) 2 calls Correct ordering matters — scale then open incident, not the reverse
Hard (5 tasks) 3 calls ID chaining — extract IDs from API responses to use in subsequent calls. Plus distractor traps.

Example: The Hardest Task (support_hard)

"A customer with email jane.doe@example.com has called back three times about an unresolved issue. Find their account, get their open tickets, and escalate the most recent one to Tier 2."

The agent must:

  1. Call lookup_customer(email="jane.doe@example.com") → response contains customer_id: "CUST-5512"
  2. Extract CUST-5512 and call get_customer_tickets(customer_id="CUST-5512", status="open") → response contains ticket_id: "TKT-8801"
  3. Extract TKT-8801 and call escalate_ticket(ticket_id="TKT-8801", target_tier="Tier 2", reason="...")

The agent can't know these IDs in advance. It must read API responses and chain them. Meanwhile, 5 other tools in the catalogue (like merge_customers or close_ticket) look tempting but are wrong.


What Makes This Submission Different

1. Dense Reward Shaping — Not Just Right or Wrong

Most environments give binary rewards: 1 for correct, 0 for wrong. That's terrible for learning. My environment gives granular partial credit at every step:

Component Weight What it rewards
Correct tool name 35% Picked rollback_service instead of restart_service
Required params present 35% Included service_name and target_version
Parameter values accurate 30% Values match what the task specified
Efficiency bonus +5% Completed the task without wasting steps

An agent that picks the right tool but forgets a parameter gets ~0.70. One that picks the right tool with right params but wrong values gets ~0.85. This gradient is what makes RL training possible.

2. Anti-Memorisation by Design

Three stochastic features prevent agents from gaming the environment:

  • Tool catalogue shuffling — The 8 tools appear in random order every episode. No position-based shortcuts.
  • Response jitter — Numeric fields (CPU %, latency, etc.) vary ±5% each episode. Same task, slightly different numbers.
  • ID preservation — IDs needed for chaining (CUST-5512, TKT-8801) are never jittered. The agent must still read them from responses.

3. Intelligent Grading — Token Overlap, Not Just Exact Match

For free-form text parameters like reason or description, exact string matching is too strict. My grader uses Jaccard token overlap:

Expected:  "checkout-service degraded, scaled to 5 replicas"
Agent said: "scaled checkout-service to 5 replicas due to degradation"
Token overlap: 0.72 → Score: 0.86 (not zero!)

For array parameters (like load balancer targets), set comparison — order doesn't matter:

Expected:  ["staging-api-01", "prod-api-01"]
Agent said: ["prod-api-01", "staging-api-01"]
Score: 1.0 internally (perfect — same set)

4. Distractor Traps That Test Real Understanding

Each domain has carefully designed distractors:

Domain Distractor Why it's tempting Why it's wrong
Incident restart_service Sounds like it fixes things Restarts the same broken version
Pipeline delete_pipeline_run Removes the failed run Destructive and irreversible
Security block_ip Blocks the attacker's IP Doesn't contain the compromised host
Cloud resize_vm Changes a VM's resources Resizes existing VM, doesn't create a new one

5. Robust Edge-Case Handling

The environment gracefully handles malformed inputs with specific, helpful feedback:

Input Response
Empty tool name "" "Empty tool name provided. Choose a tool from the catalogue."
Unknown tool "foobar" "Unknown tool 'foobar'. Not in the available catalogue."
Wrong tool (but valid) "Wrong tool 'restart_service'. Expected a different API call."
Step without reset "No active episode. Call reset() first."

Technical Architecture

inference.py                    (Agent script — Oracle or LLM)
    │
    ├── models.py               (Pydantic: Action + Observation contracts)
    ├── server/tasks.py          (15 tasks, 5 tool catalogues, simulated responses)
    └── server/environment.py    (Core RL logic: reset, step, grade, jitter)
            │
            └── server/app.py    (FastAPI: /reset /step /tasks /grader /baseline)
                    │
                    └── Dockerfile + openenv.yaml  (Deployment)

Files at a Glance

File Lines Purpose
server/tasks.py 288 15 task definitions across 5 domains with tool catalogues and simulated responses
server/api_tool_planner_environment.py 335 Core environment: reset, step, reward calculation, value comparison, response jitter
inference.py 276 Oracle + LLM agents, strict [START]/[STEP]/[END] stdout format
server/app.py 134 FastAPI server with custom /tasks, /grader, /baseline, /health endpoints
models.py 50 Typed Pydantic models for Action (tool_name + params) and Observation (11 fields)
client.py 79 WebSocket client for remote environment interaction
tests/test_environment.py 415 46 unit tests — all passing

Verification — Everything Works

Tests: 46/46 passing

$ uv run python -m pytest tests/test_environment.py -v
============================= 46 passed in 1.66s ==============================

Tests cover: all 15 tasks (perfect scores), reward mechanics (partial credit, efficiency bonus, token overlap, list comparison), stochastic features (shuffling, jitter, ID preservation), edge cases (empty tools, unknown tools, no-reset errors), and episode lifecycle.

Oracle Baseline: 15/15 tasks at 0.99

$ uv run python inference.py --oracle
[START] task=incident_easy env=api_tool_planner model=oracle
[STEP] step=1 action=call(get_service_metrics,...) reward=0.99 done=true error=null
[END] success=true steps=1 score=0.99 rewards=0.99
...
============================================================
  Agent: oracle
  Tasks: 15  |  Average score: 0.9900
============================================================

OpenEnv Validate: Passing

$ uv run openenv validate
[OK] api_tool_planner: Ready for multi-mode deployment

How to Run

# Install dependencies (include dev for tests, inference for OpenAI client)
cd api_tool_planner
uv sync --extra dev --extra inference

# Run tests
uv run python -m pytest tests/test_environment.py -v

# Run oracle baseline (no API key needed)
uv run python inference.py --oracle

# Run LLM agent
export HF_TOKEN=hf_...
uv run python inference.py

# Start the server
uv run server --host 0.0.0.0 --port 8000
# Then open http://localhost:8000/docs for Swagger UI

# Validate
uv run openenv validate

# Docker
docker build -t api-tool-planner .
docker run -p 8000:8000 api-tool-planner

API Endpoints

Endpoint Method Description
/reset POST Start a new episode
/step POST Execute a tool-call action
/state GET Current episode metadata
/tasks GET List all 15 tasks
/grader POST Score a single action (stateless)
/baseline GET Run oracle across all tasks
/health GET Health check
/ws WS WebSocket for stateful sessions
/docs GET Auto-generated Swagger UI

The Reward Function

step_reward = (tool_match × 0.35 + params_present × 0.35 + param_values × 0.30) / num_expected_calls
  • Total episode reward: strictly within (0, 1) — never exactly 0.0 or 1.0
  • Efficiency bonus: +0.05 if completed within num_calls + 1 steps
  • Cumulative reward: capped at 0.99

Parameter value scoring:

  • Exact match → 1.0
  • Token overlap ≥ 50% → 0.75–1.0
  • Substring match → 0.5
  • Numeric within ±10% → 0.5
  • List/set match (order-independent) → 1.0
  • No match → 0.0

Project Structure

api_tool_planner/
├── README.md                          # This file
├── STUDENT_README.md                  # Detailed technical documentation
├── learn.md                           # Learning roadmap (prerequisites + codebase walkthrough)
├── openenv.yaml                       # OpenEnv spec configuration
├── pyproject.toml                     # Dependencies and entry points
├── Dockerfile                         # Multi-stage Docker build
├── inference.py                       # Mandatory inference script (Oracle + LLM)
├── models.py                          # Pydantic Action/Observation models
├── client.py                          # WebSocket client
├── __init__.py                        # Package exports
├── server/
│   ├── __init__.py
│   ├── api_tool_planner_environment.py  # Core environment (reset/step/grade)
│   ├── tasks.py                         # 15 task definitions across 5 domains
│   └── app.py                           # FastAPI application + custom endpoints
├── tests/
│   ├── __init__.py
│   └── test_environment.py              # 46 unit tests
└── scripts/
    └── validate-submission.sh           # Pre-submission validator

Built with OpenEnv by Meta PyTorch. Submitted for the Scaler School of Technology Hackathon, Round 1.

About

API Tool Planner is an RL environment that trains LLM agents to select and chain the correct API calls in real-world scenarios. It features 15 tasks across 5 domains, dense reward shaping, anti-memorization techniques, and realistic distractors to teach agents accurate decision-making and execution.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors