OpenAgent

Can Agents Generalize to the Open World?

Unveiling the Fragility of Static Training in Tool Use

Weiming Wu* · Song-Lin Lv* · Rui Zhu · Zi-Jian Cheng · Lan-Zhe Guo

Nanjing University · National Key Laboratory for Novel Software Technology
* Equal contribution

Official implementation of OpenAgent, a controlled sandbox benchmark for testing whether tool-use agents trained in a static environment can generalize when user queries, tool schemas, observations, and task domains change at inference time.

🚀 TL;DR

Task: OpenAgent studies tool-use agents under open-world environment shifts rather than only static train-test matches.
Benchmark: We build controlled sandbox tasks where query wording, tool definitions, observations, and domains can be changed independently.
Diagnosis: The evaluation is organized into four tiers: Perception, Interaction, Reasoning, and Internalization.
Code: This repository includes data generation, sandbox tools, scenario configs, evaluation runners, smoke tests, and reusable scripts.

🔄 Open Environment Shifts

Shift	What changes at inference time	What it probes
Query shift	Rewritten, ambiguous, underspecified, or redundant user requests	Intent abstraction beyond surface phrasing
Action shift	Renamed tools, noisy schemas, distractor tools, and changed verification codes	Tool grounding beyond memorized names
Observation shift	Missing values, non-standard feedback, runtime errors, and tool redirection	Feedback parsing and recovery
Domain shift	Transfer from the POI sandbox to an unseen medical-domain tool environment	Structural transfer across domains

🧭 Four-Tier Sandbox

OpenAgent groups the shifts into a reusable four-tier diagnostic hierarchy:

Tier	Focus	Representative scenarios
Perception	Parse user intent and tool semantics	query rewrite, tool-name noise, semantic renaming, distractor tools
Interaction	React to environment feedback	missing information, mode changes, tool errors, redirection
Reasoning	Adapt the execution graph and calculation rules	unavailable tools, altered tool logic
Internalization	Transfer the learned task structure	domain transfer to medical tools

🧩 Repository Contents

Module	Purpose
`datagen/`	Build POI tool-use training and test trajectories
`eval/`	Run configurable four-tier open-world evaluation scenarios
`tools/`	Execute POI and medical-domain sandbox tools
`examples/`	Provide a tiny local CSV for smoke checks
`scripts/`	Provide simple command-line entry points
`tests/`	Run model-free checks for the data and tool pipeline

⚙️ Installation

git clone https://github.com/LAMDA-NeSy/OpenAgent.git
cd OpenAgent

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

On Windows PowerShell, activate the environment with:

.\.venv\Scripts\Activate.ps1

If you want to use LLM-based query rewriting during data generation:

cp .env.example .env
# Fill OPENAI_API_KEY, OPENAI_BASE_URL, and REWRITE_MODELS if needed.

⚡ Quick Start

Run the smoke test first. It generates a small dataset from examples/sample_database.csv and checks that the sandbox tools can execute the generated calls.

python tests/smoke_test.py

Generate a demo test set:

bash scripts/generate_sample_data.sh

The output is written to:

outputs/demo/demo_test.json

🧪 Dataset Generation

Generate test data from a POI CSV:

python datagen/generate.py \
  --split test \
  --csv data/database_test_en.csv \
  --output data/testset/baseline.json \
  --samples 50 \
  --single-samples 20 \
  --seed 7

Generate training data with optional LLM rewriting:

python datagen/generate.py \
  --split train \
  --csv data/database_en.csv \
  --output data/train.json \
  --samples 800 \
  --rewrite

Expected CSV columns:

name,location,type,adname,tel,cost1,cost2

location should use longitude,latitude format. Telephone numbers are normalized to eight digits.

🛠️ Sandbox Evaluation

Run one scenario:

python eval/run_eval.py \
  --model_path /path/to/model \
  --scenario layer1_name_noise \
  --variant 1 \
  --data_root data \
  --output_root outputs

Run multiple scenarios across GPUs:

python eval/run_all.py \
  --models /path/to/model1 /path/to/model2 \
  --scenarios layer1_name_noise layer2_tool_error layer3_error_tool layer4_domain_medical \
  --variants 1 2 3 \
  --data_root data \
  --output_root outputs \
  --gpu_ids 0 1 2 3 \
  --min_memory_mb 50000

Available scenario configs live in eval/configs/:

Config	Tier	Shift
`layer1_name_noise`	Perception	Syntactic tool-name noise
`layer1_name_semantic`	Perception	Semantic tool-name paraphrase
`layer1_name_wo_semantic`	Perception	Non-semantic tool rename
`layer1_query_rewrite`	Perception	User-query rewrite
`layer1_add_related`	Perception	Related tool distractors
`layer1_mode_change`	Perception	Verification-code change
`layer2_information_loss`	Interaction	Missing values
`layer2_mode_change`	Interaction	Parameter-mode change
`layer2_query_condition`	Interaction	Query-condition change
`layer2_tool_error`	Interaction	Tool execution error
`layer3_error_tool`	Reasoning	Required tool unavailable
`layer4_domain_medical`	Internalization	Domain transfer

Outputs are saved as:

outputs/
  results/<scenario>/<model>_<scenario>_<variant>.json
  metrics/<scenario>/<model>_<scenario>_<variant>.json

📦 Data Format

Each generated example has this structure:

{
  "id": 0,
  "source": "distance",
  "conversations": [
    {"from": "human", "value": "Please calculate the distance ..."},
    {"from": "gpt", "value": "<tool_call>{\"tool\": \"search_map_coordinates\", \"args\": {\"name\": \"alpha_cafe\"}}</tool_call>"},
    {"from": "observation", "value": "{\"location\": \"116.300100,39.900100\"}"},
    {"from": "gpt", "value": "<answer>123.4</answer>"}
  ],
  "tools": "[...]",
  "gt_tools": ["search_map_coordinates"],
  "ground_truth": "123.4",
  "meta": {"total_steps": 2}
}

📝 Citation

@inproceedings{wu2026openagent,
  title     = {Can Agents Generalize to the Open World? Unveiling the Fragility of Static Training in Tool Use},
  author    = {Wu, Weiming and Lv, Song-Lin and Zhu, Rui and Cheng, Zi-Jian and Guo, Lan-Zhe},
  booktitle = {Proceedings of the 43rd International Conference on Machine Learning},
  year      = {2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.claude		.claude
.github/workflows		.github/workflows
assets		assets
datagen		datagen
eval		eval
examples		examples
paft		paft
scripts		scripts
tests		tests
tools		tools
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

OpenAgent

Can Agents Generalize to the Open World?

🚀 TL;DR

📋 Table of Contents

🔄 Open Environment Shifts

🧭 Four-Tier Sandbox

🧩 Repository Contents

⚙️ Installation

⚡ Quick Start

🧪 Dataset Generation

🛠️ Sandbox Evaluation

📦 Data Format

📝 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

OpenAgent

Can Agents Generalize to the Open World?

🚀 TL;DR

📋 Table of Contents

🔄 Open Environment Shifts

🧭 Four-Tier Sandbox

🧩 Repository Contents

⚙️ Installation

⚡ Quick Start

🧪 Dataset Generation

🛠️ Sandbox Evaluation

📦 Data Format

📝 Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages