How does agent architecture affect performance when the tool surface is held constant via MCP?
This repository contains the code, benchmarks, and analysis for our research on comparing single-agent vs. multi-agent architectures for LLM tool use, evaluated on standardized Model Context Protocol (MCP) benchmarks.
- Key Findings
- Research Questions
- Project Structure
- Installation
- Quick Start
- Benchmarks
- Architectures
- Results
- Citation
- Team
- Acknowledgments
| Finding | Description |
|---|---|
| Multi-agent matches frontier models | GPT-OSS 120B multi-agent achieves 42.9% success (same as Gemini 2.5 Pro) with 31Γ fewer LLM calls |
| Architecture compensates for model size | Multi-agent orchestration enables smaller models to match larger single-agent performance |
| MCP enables fair comparison | 92.3% tool success rate confirms performance differences are architectural, not tool-related |
| Exceptional cost efficiency | CrewAI + Gemma 2 9B achieves 68.2% success at 0.04% of GPT-4.1's cost |
- Which agent architecture demonstrates superior performance across multi-domain MCP tasks?
- Does MCP standardization enable fair architecture comparison?
- Which task types benefit most from multi-agent coordination?
- What is the cost-performance tradeoff for local vs. cloud models?
NLP_MCP_AGENTS/
βββ MCP-Universe/ # Forked MCP-Universe benchmark
β βββ mcpuniverse/
β βββ benchmark/
β β βββ configs/
β β β βββ test/
β β β βββ web_search.yaml
β β β βββ location_navigation.yaml
β β β βββ web_search/ # 55 task definitions
β β βββ runner.py
β β βββ report.py
β βββ agents/
β β βββ web_search_react.py # Web search orchestrator
β β βββ query_formulation_agent.py # Query optimization
β β βββ search_execution_agent.py # Search execution
β β βββ content_fetch_agent.py # Content retrieval
β β βββ fact_verification_agent.py # Fact verification
β βββ mcp/
β βββ servers/
β βββ google_search/ # SerpAPI integration
β βββ fetch/ # Content fetching
βββ crewai_navigation/ # CrewAI location navigation
β βββ agents/
β β βββ orchestrator.py
β β βββ route_planning_agent.py
β β βββ distance_optimization_agent.py
β β βββ time_optimization_agent.py
β β βββ place_finding_agent.py
β βββ tools/
β β βββ google_maps_tools.py # 8 MCP-wrapped tools
β βββ llm_config.py # Ollama/Gemma configuration
βββ results/ # Benchmark outputs
β βββ web_search/
β β βββ GPTOSS120B_multiagent.md
β β βββ GPTOSS20B_single.md
β β βββ Gemini2.5_Pro.md
β βββ location_navigation/
β βββ crewai_gemma2_9b.md
βββ docs/
β βββ WEB_SEARCH_ARCHITECTURE.md
β βββ LOCATION_NAVIGATION_ARCHITECTURE.md
β βββ figures/
βββ paper/
βββ acl_paper.tex
βββ acl_paper.pdf
- Python 3.11+
- Ollama (for local model inference)
- Google Cloud API key (for Maps API)
- SerpAPI key (for web search)
# Clone the repository
git clone https://github.com/prasad-yashdeep/NLP_MCP_AGENTS.git
cd NLP_MCP_AGENTS
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Install MCP-Universe
cd MCP-Universe
pip install -e .
cd ..
# Install CrewAI components
pip install crewai==0.41.1 crewai-tools
# Pull Ollama models (for local inference)
ollama pull gemma2:9b
ollama pull gemma3:27bCreate a .env file in the project root:
# API Keys
SERPAPI_KEY=your_serpapi_key
GOOGLE_MAPS_API_KEY=your_google_maps_key
OPENROUTER_API_KEY=your_openrouter_key # For GPT-OSS models
# Ollama Configuration
OLLAMA_HOST=http://localhost:11434
# MCP Configuration
MCP_SERVER_TIMEOUT=30# Single-agent baseline (GPT-OSS 20B)
python -m mcpuniverse.benchmark.runner \
--config benchmark/configs/test/web_search.yaml \
--agent single \
--model gpt-oss-20b
# Multi-agent (GPT-OSS 120B orchestrator)
python -m mcpuniverse.benchmark.runner \
--config benchmark/configs/test/web_search.yaml \
--agent multi \
--model gpt-oss-120b
# Gemini 2.5 Pro baseline
python -m mcpuniverse.benchmark.runner \
--config benchmark/configs/test/web_search.yaml \
--agent single \
--model gemini-2.5-pro# CrewAI with Gemma 2 9B
cd crewai_navigation
python run_benchmark.py --tasks all --model gemma2:9b
# Run specific task categories
python run_benchmark.py --tasks route_planning --model gemma2:9b
python run_benchmark.py --tasks place_finding --model gemma2:9bpython -m mcpuniverse.benchmark.report \
--input results/web_search/ \
--output reports/web_search_summary.md| Task Category | # Tasks | Description |
|---|---|---|
| Factual Information | 17 | Single/multi-fact lookup, statistics |
| Comparison & Analysis | 14 | Entity comparison, rankings, trends |
| Research & Synthesis | 14 | Deep research, topic summarization |
| Current Events | 8 | News retrieval, live data |
| Specialized Search | 2 | Local search, product search |
Example Task:
{
"category": "web_search",
"question": "What is the current population of Tokyo and its GDP?",
"mcp_servers": ["google-search", "fetch"],
"evaluators": [
{"func": "json -> get(population)", "op": "in_range", "value": [13000000, 14500000]}
]
}| Task Category | # Tasks | Avg Success | Description |
|---|---|---|---|
| Place Finding | 11 | 72.8% | Location discovery, coordinate search |
| Time Optimization | 9 | 68.4% | Travel time minimization |
| Route Planning | 10 | 67.5% | Multi-city itineraries |
| Distance Optimization | 10 | 67.1% | Midpoint calculation |
| Multi-modal | 5 | 0% | Complex real-time constraints |
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β WebSearchOrchestrator (GPT-OSS 120B) β
β ReAct Loop (max 12 iter) β
βββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββΌββββββββββββββββββββββ
βΌ βΌ βΌ
βββββββββββ βββββββββββ βββββββββββ
β Query β β Search β β Content β
β Agent β β Agent β β Fetch β
β (20B) β β (20B) β β (20B) β
ββββββ¬βββββ ββββββ¬βββββ ββββββ¬βββββ
β β β
βββββββββββββββββββββΌββββββββββββββββββββ
βΌ
ββββββββββββββββββββ
β MCP Servers β
β β’ Google Search β
β β’ Fetch β
ββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CrewAI Orchestrator (Gemma 2 9B) β
β Task Analysis β Agent Selection β Response Synthesis β
βββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββ
β
ββββββββββββ¬βββββββββββΌβββββββββββ¬βββββββββββ
βΌ βΌ βΌ βΌ βΌ
ββββββββββ ββββββββββ ββββββββββ ββββββββββ
β Route β βDistanceβ β Time β β Place β
βPlanningβ β Opt β β Opt β βFinding β
βββββ¬βββββ βββββ¬βββββ βββββ¬βββββ βββββ¬βββββ
ββββββββββββ΄βββββββββββ΄βββββββββββ
β
ββββββββββββΌβββββββββββ
β Google Maps MCP β
β (8 tools) β
βββββββββββββββββββββββ
| Model (+ Architecture) | Tasks | Success Rate | Avg LLM Calls | Efficiency |
|---|---|---|---|---|
| GPT-OSS 20B (Single) | 7 | 0.0% | 168.9 | Very Low |
| +Multi-Agent (120B orch.) | 7 | 42.9% | 5.4 | High |
| Gemini 2.5 Pro (Single) | 7 | 42.9% | 8.3 | High |
Key Insight: Multi-agent achieves same success as Gemini 2.5 Pro with 31Γ fewer LLM calls than single-agent baseline.
| Model (+ Framework) | Tasks | Route Plan | Time Opt | Dist Opt | Place Find | Overall |
|---|---|---|---|---|---|---|
| Gemma 2 9B + CrewAI | 45 | 67.5% | 68.4% | 67.1% | 72.8% | 68.2% |
| GPT-4.1 (baseline) | 45 | 62.5% | 81.1% | 65.2% | 88.8% | 86.7% |
Key Insight: CrewAI achieves 68.2% success at 0.04% of GPT-4.1's cost ($0.05 vs $127.50 per 1K queries).
| Tool | Calls | Success Rate |
|---|---|---|
| maps_geocode | 234 | 95.7% |
| maps_search_places | 189 | 91.2% |
| maps_directions | 167 | 89.8% |
| maps_distance_matrix | 143 | 93.4% |
| Average | 733 | 92.3% |
If you use this work, please cite:
@article{prasad2025dual,
title={Dual Perspectives on LLM Agent Performance: Evaluating Agentic Architectures with MCP},
author={Prasad, Yashdeep and Shetty, Bhumika Dinesh and Gehani, Ronit and Jagtap, Vedant},
journal={arXiv preprint},
year={2025}
}| Name | Contribution | Contact |
|---|---|---|
| Yashdeep Prasad | MCP-Universe setup, evaluation pipeline, analysis | yp2693@nyu.edu |
| Bhumika Dinesh Shetty | AutoGen integration, location/navigation experiments | bds9746@nyu.edu |
| Ronit Gehani | Playwright/browser-automation tests, error analysis | rg4881@nyu.edu |
| Vedant Jagtap | CrewAI setup, OpenRouter routing, metric collation | vsj7589@nyu.edu |
- MCP-Universe by Salesforce AI Research
- CrewAI framework
- NYU HPC for computing resources
- NYU NLP research group for feedback
- Schick et al. (2023). Toolformer: Language Models Can Learn to Use Tools
- Yao et al. (2023). ReAct: Synergizing Reasoning and Acting in Language Models
- Hong et al. (2023). MetaGPT: Meta Programming for Multi-Agent Collaborative Framework
- Liu et al. (2023). AgentBench: Evaluating LLMs as Agents
This project is licensed under the MIT License - see the LICENSE file for details.
Built with β€οΈ at New York University
