Skip to content

prasad-yashdeep/NLP_MCP_AGENTS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

11 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Dual Perspectives on LLM Agent Performance: Evaluating Agentic Architectures with MCP

Python 3.11+ License: MIT MCP-Universe

How does agent architecture affect performance when the tool surface is held constant via MCP?

This repository contains the code, benchmarks, and analysis for our research on comparing single-agent vs. multi-agent architectures for LLM tool use, evaluated on standardized Model Context Protocol (MCP) benchmarks.

Architecture Overview

πŸ“‹ Table of Contents

🎯 Key Findings

Finding Description
Multi-agent matches frontier models GPT-OSS 120B multi-agent achieves 42.9% success (same as Gemini 2.5 Pro) with 31Γ— fewer LLM calls
Architecture compensates for model size Multi-agent orchestration enables smaller models to match larger single-agent performance
MCP enables fair comparison 92.3% tool success rate confirms performance differences are architectural, not tool-related
Exceptional cost efficiency CrewAI + Gemma 2 9B achieves 68.2% success at 0.04% of GPT-4.1's cost

πŸ”¬ Research Questions

  1. Which agent architecture demonstrates superior performance across multi-domain MCP tasks?
  2. Does MCP standardization enable fair architecture comparison?
  3. Which task types benefit most from multi-agent coordination?
  4. What is the cost-performance tradeoff for local vs. cloud models?

πŸ“ Project Structure

NLP_MCP_AGENTS/
β”œβ”€β”€ MCP-Universe/                    # Forked MCP-Universe benchmark
β”‚   └── mcpuniverse/
β”‚       β”œβ”€β”€ benchmark/
β”‚       β”‚   β”œβ”€β”€ configs/
β”‚       β”‚   β”‚   └── test/
β”‚       β”‚   β”‚       β”œβ”€β”€ web_search.yaml
β”‚       β”‚   β”‚       β”œβ”€β”€ location_navigation.yaml
β”‚       β”‚   β”‚       └── web_search/          # 55 task definitions
β”‚       β”‚   β”œβ”€β”€ runner.py
β”‚       β”‚   └── report.py
β”‚       β”œβ”€β”€ agents/
β”‚       β”‚   β”œβ”€β”€ web_search_react.py          # Web search orchestrator
β”‚       β”‚   β”œβ”€β”€ query_formulation_agent.py   # Query optimization
β”‚       β”‚   β”œβ”€β”€ search_execution_agent.py    # Search execution
β”‚       β”‚   β”œβ”€β”€ content_fetch_agent.py       # Content retrieval
β”‚       β”‚   └── fact_verification_agent.py   # Fact verification
β”‚       └── mcp/
β”‚           └── servers/
β”‚               β”œβ”€β”€ google_search/           # SerpAPI integration
β”‚               └── fetch/                   # Content fetching
β”œβ”€β”€ crewai_navigation/               # CrewAI location navigation
β”‚   β”œβ”€β”€ agents/
β”‚   β”‚   β”œβ”€β”€ orchestrator.py
β”‚   β”‚   β”œβ”€β”€ route_planning_agent.py
β”‚   β”‚   β”œβ”€β”€ distance_optimization_agent.py
β”‚   β”‚   β”œβ”€β”€ time_optimization_agent.py
β”‚   β”‚   └── place_finding_agent.py
β”‚   β”œβ”€β”€ tools/
β”‚   β”‚   └── google_maps_tools.py     # 8 MCP-wrapped tools
β”‚   └── llm_config.py                # Ollama/Gemma configuration
β”œβ”€β”€ results/                         # Benchmark outputs
β”‚   β”œβ”€β”€ web_search/
β”‚   β”‚   β”œβ”€β”€ GPTOSS120B_multiagent.md
β”‚   β”‚   β”œβ”€β”€ GPTOSS20B_single.md
β”‚   β”‚   └── Gemini2.5_Pro.md
β”‚   └── location_navigation/
β”‚       └── crewai_gemma2_9b.md
β”œβ”€β”€ docs/
β”‚   β”œβ”€β”€ WEB_SEARCH_ARCHITECTURE.md
β”‚   β”œβ”€β”€ LOCATION_NAVIGATION_ARCHITECTURE.md
β”‚   └── figures/
└── paper/
    β”œβ”€β”€ acl_paper.tex
    └── acl_paper.pdf

πŸ›  Installation

Prerequisites

  • Python 3.11+
  • Ollama (for local model inference)
  • Google Cloud API key (for Maps API)
  • SerpAPI key (for web search)

Setup

# Clone the repository
git clone https://github.com/prasad-yashdeep/NLP_MCP_AGENTS.git
cd NLP_MCP_AGENTS

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Install MCP-Universe
cd MCP-Universe
pip install -e .
cd ..

# Install CrewAI components
pip install crewai==0.41.1 crewai-tools

# Pull Ollama models (for local inference)
ollama pull gemma2:9b
ollama pull gemma3:27b

Environment Variables

Create a .env file in the project root:

# API Keys
SERPAPI_KEY=your_serpapi_key
GOOGLE_MAPS_API_KEY=your_google_maps_key
OPENROUTER_API_KEY=your_openrouter_key  # For GPT-OSS models

# Ollama Configuration
OLLAMA_HOST=http://localhost:11434

# MCP Configuration
MCP_SERVER_TIMEOUT=30

πŸš€ Quick Start

Run Web Search Benchmark

# Single-agent baseline (GPT-OSS 20B)
python -m mcpuniverse.benchmark.runner \
    --config benchmark/configs/test/web_search.yaml \
    --agent single \
    --model gpt-oss-20b

# Multi-agent (GPT-OSS 120B orchestrator)
python -m mcpuniverse.benchmark.runner \
    --config benchmark/configs/test/web_search.yaml \
    --agent multi \
    --model gpt-oss-120b

# Gemini 2.5 Pro baseline
python -m mcpuniverse.benchmark.runner \
    --config benchmark/configs/test/web_search.yaml \
    --agent single \
    --model gemini-2.5-pro

Run Location Navigation Benchmark

# CrewAI with Gemma 2 9B
cd crewai_navigation
python run_benchmark.py --tasks all --model gemma2:9b

# Run specific task categories
python run_benchmark.py --tasks route_planning --model gemma2:9b
python run_benchmark.py --tasks place_finding --model gemma2:9b

Generate Benchmark Report

python -m mcpuniverse.benchmark.report \
    --input results/web_search/ \
    --output reports/web_search_summary.md

πŸ“Š Benchmarks

Web Search (MCP-Universe)

Task Category # Tasks Description
Factual Information 17 Single/multi-fact lookup, statistics
Comparison & Analysis 14 Entity comparison, rankings, trends
Research & Synthesis 14 Deep research, topic summarization
Current Events 8 News retrieval, live data
Specialized Search 2 Local search, product search

Example Task:

{
  "category": "web_search",
  "question": "What is the current population of Tokyo and its GDP?",
  "mcp_servers": ["google-search", "fetch"],
  "evaluators": [
    {"func": "json -> get(population)", "op": "in_range", "value": [13000000, 14500000]}
  ]
}

Location Navigation (MCP-Universe)

Task Category # Tasks Avg Success Description
Place Finding 11 72.8% Location discovery, coordinate search
Time Optimization 9 68.4% Travel time minimization
Route Planning 10 67.5% Multi-city itineraries
Distance Optimization 10 67.1% Midpoint calculation
Multi-modal 5 0% Complex real-time constraints

πŸ— Architectures

Web Search Multi-Agent (OpenAI Agents SDK)

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚           WebSearchOrchestrator (GPT-OSS 120B)              β”‚
β”‚                  ReAct Loop (max 12 iter)                   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                          β”‚
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β–Ό                     β–Ό                     β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Query   β”‚         β”‚ Search  β”‚         β”‚ Content β”‚
β”‚ Agent   β”‚         β”‚ Agent   β”‚         β”‚ Fetch   β”‚
β”‚ (20B)   β”‚         β”‚ (20B)   β”‚         β”‚ (20B)   β”‚
β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜
     β”‚                   β”‚                   β”‚
     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β–Ό
              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              β”‚   MCP Servers    β”‚
              β”‚ β€’ Google Search  β”‚
              β”‚ β€’ Fetch          β”‚
              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Location Navigation Multi-Agent (CrewAI)

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚            CrewAI Orchestrator (Gemma 2 9B)                 β”‚
β”‚     Task Analysis β†’ Agent Selection β†’ Response Synthesis    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                          β”‚
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β–Ό          β–Ό          β–Ό          β–Ό          β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Route  β”‚ β”‚Distanceβ”‚ β”‚  Time  β”‚ β”‚ Place  β”‚
β”‚Planningβ”‚ β”‚  Opt   β”‚ β”‚  Opt   β”‚ β”‚Finding β”‚
β””β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”¬β”€β”€β”€β”€β”˜
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β”‚
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚  Google Maps MCP    β”‚
         β”‚  (8 tools)          β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ“ˆ Results

Web Search Benchmark

Model (+ Architecture) Tasks Success Rate Avg LLM Calls Efficiency
GPT-OSS 20B (Single) 7 0.0% 168.9 Very Low
+Multi-Agent (120B orch.) 7 42.9% 5.4 High
Gemini 2.5 Pro (Single) 7 42.9% 8.3 High

Key Insight: Multi-agent achieves same success as Gemini 2.5 Pro with 31Γ— fewer LLM calls than single-agent baseline.

Location Navigation Benchmark

Model (+ Framework) Tasks Route Plan Time Opt Dist Opt Place Find Overall
Gemma 2 9B + CrewAI 45 67.5% 68.4% 67.1% 72.8% 68.2%
GPT-4.1 (baseline) 45 62.5% 81.1% 65.2% 88.8% 86.7%

Key Insight: CrewAI achieves 68.2% success at 0.04% of GPT-4.1's cost ($0.05 vs $127.50 per 1K queries).

MCP Tool Success Rates

Tool Calls Success Rate
maps_geocode 234 95.7%
maps_search_places 189 91.2%
maps_directions 167 89.8%
maps_distance_matrix 143 93.4%
Average 733 92.3%

πŸ“„ Citation

If you use this work, please cite:

@article{prasad2025dual,
  title={Dual Perspectives on LLM Agent Performance: Evaluating Agentic Architectures with MCP},
  author={Prasad, Yashdeep and Shetty, Bhumika Dinesh and Gehani, Ronit and Jagtap, Vedant},
  journal={arXiv preprint},
  year={2025}
}

πŸ‘₯ Team

Name Contribution Contact
Yashdeep Prasad MCP-Universe setup, evaluation pipeline, analysis yp2693@nyu.edu
Bhumika Dinesh Shetty AutoGen integration, location/navigation experiments bds9746@nyu.edu
Ronit Gehani Playwright/browser-automation tests, error analysis rg4881@nyu.edu
Vedant Jagtap CrewAI setup, OpenRouter routing, metric collation vsj7589@nyu.edu

πŸ™ Acknowledgments

  • MCP-Universe by Salesforce AI Research
  • CrewAI framework
  • NYU HPC for computing resources
  • NYU NLP research group for feedback

πŸ“š References

πŸ“œ License

This project is licensed under the MIT License - see the LICENSE file for details.


Built with ❀️ at New York University

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •