Python Is All You Need? Introducing Dria-Agent-α

Community Article Published January 10, 2025

Introduction

Currently the main way for large language models (LLMs) to interact with tools, which goes by many names (tool use, function calling, etc.) is through giving the LLM a specification of the tools it can use (including the arguments of the tool) and the LLM to output a JSON schema that contains the tool(s) to use and the argument(s) to be used within those tools. This approach, while straightforward and reliable, limits the expressive capabilities of LLMs, which are able to elicit more complex reasoning and solutions through programming languages like Python. To this end, we define a framework for LLMs to use tools through Python called Pythonic Function Callling, which prompts the LLM to output actions in Python code.

The motivations behind using Python to interact with tools are the following:

  1. Reasoning in LLMs is primarily driven by procedural knowledge in pretraining, particularly code documents. [1]
  2. LLMs equipped with the ability to use Python perform better in agentic scenarios compared to those using JSON-based function calling. [2]
  3. Python is a very popular programming language [3], probably very abundant in the pretraining data of many LLMs, and very close to human natural language with a very pseudocode-like syntax.

Example

Let's start with a simple example. Take the user query: "Can you check if I have tomorrow 10:00-12:00 available and make an appointment for a meeting with my thesis supervisor if so? If you made the appointment, please add it to my reminders." and the available functions:

def check_availability(day: str, start_time: str, end_time: str) -> bool:
    """ 
    Check if a time slot is available on a given day.

    Args:
    - day: The day to check in YYYY-MM-DD format
    - start_time: Start time in HH:MM format
    - end_time: End time in HH:MM format

    Returns:
    - True if slot is available, False otherwise
    """
    pass

def make_appointment(day: str, start_time: str, end_time: str) -> dict:
    """ 
    Make an appointment for a given time slot.

    Args:
    - day: The day to make appointment in YYYY-MM-DD format
    - start_time: Start time in HH:MM format
    - end_time: End time in HH:MM format
    - title: The title of the appointment

    Returns:
    - A dictionary with the appointment details and if it's made or not.
        dict keys:
            - day (str): The day the appointment is on, in YYYY-MM-DD format
            - start_time (str): Start time in HH:MM format
            - end_time (str): End time in HH:MM format
            - appointment_made (bool): Whether the appointment is successfully made or not. 
    """
    pass

def add_to_reminders(reminder_text: str) -> bool:
    """ 
    Add a text to reminders.

    Args: 
    - reminder_text: The text to add to reminders

    Returns:
    - Whether the reminder was successfully created or not.
    """
    pass

With JSON-based function calling, this would take multiple chat turns to process, as it involves two conditionals (checking if the user has the given time slot available and checking if the user has successfully made an appointment with their thesis supervisor) which would require the LLM to receive the function call results and move forward with the next step given the results. In Pythonic function calling, this becomes a trivial task that the LLM can perform in a single chat turn, like so:

from datetime import datetime, timedelta

# Calculate tomorrow's date
tomorrow = (datetime.now() + timedelta(days=1)).strftime("%Y-%m-%d")

# Check availability for 10:00-12:00
available = check_availability(tomorrow, "10:00", "12:00")

# If available, make the appointment
appointment_details = make_appointment(
    day=tomorrow,
    start_time="10:00",
    end_time="12:00",
    title="Meeting with thesis supervisor"
)

# Add appointment to reminders if it was successfully made
if appointment_details["appointment_made"]:
    reminder_text = f"Meeting with thesis supervisor on {tomorrow} from 10:00-12:00"
    add_to_reminders(reminder_text)

Code Execution

Since the model generates Python code for function calling, we utilize exec-python to parse the model output and executes the code along with the functions. This enables straightforward integration of the Pythonic function calling approach of Dria-Agent-α, handling the execution of the generated code in a safe and controlled manner.

The execution environment provides structured output that tracks function calls, variable states, and any errors that occur during execution. For example:

x = [1, 2]
y = [2, 3]
z = pair_sum(x, y)
k = pair_sum(z, z)

produces:

{
  "function_results": {
    "pair_sum": ["z","k"]
  },
  "variables": {
    "x": [1,2],
    "y": [2,3],
    "z": [3,5],
    "k": [6,10]
  },
  "errors": []
}

This structured output is particularly valuable for multi-turn agentic conversations, as it allows the model to maintain awareness of previous computations and their results, enabling more complex reasoning chains and state-dependent decision making in subsequent interactions.

Methodology

Dria-Agent-α was developed using synthetic data generated through Dria. Dria is a network of LLMs operating in a distributed system, providing high-throughput and powerful pipeline tools for data generation across diverse models.

We designed a framework that creates realistic scenarios requiring complex problem-solving skills, challenging the model to break down problems into manageable steps and utilize provided functions effectively, mimicking real-world tool use cases. The synthetic data generation pipeline, which we plan to release before February 2025 after code cleanup, consists of the following steps:

  1. Manually define categories and subcategories that represent different domains of tool usage (as shown in the repository structure with domain/subdomain pairs)
  2. Generate synthetic scenarios using a multi-stage pipeline
  3. Mock function generation
  4. User query generation
  5. Validation of mock functions
  6. Validation by code execution
  7. Final dataset compilation

Data Anatomy

Our training data consists of several components:

  • User queries
  • Python functions with docstrings and parameters (along with their JSON equivalents)
  • Mock function implementations that produce expected outputs for correct parameters and different outputs for incorrect ones
  • Checklists validating required function calls and their expected outputs for each query

Sample entry:

{
    "difficulty": "hard",
    "function_schema_python": "def get_database_metrics(server_id: str, time_range: str) -> dict:\n    \"\"\"Retrieves performance metrics for a specified database server.\n\n    :param server_id: The unique identifier of the database server.\n    :param time_range: Time range for metrics (e.g., '1h', '24h', '7d').\n    :return: Dictionary containing performance metrics.\n    :raises ValueError: If server_id is invalid or time_range is unsupported.\"\"\"\n    pass\ndef analyze_query_performance(query_logs: list) -> dict:\n    \"\"\"Analyzes database query performance from provided logs.\n\n    :param query_logs: List of query log entries.\n    :return: Dictionary containing query analysis results.\n    :raises ValueError: If query_logs is empty or invalid.\"\"\"\n    pass\ndef get_scaling_recommendations(metrics: dict) -> dict:\n    \"\"\"Provides scaling recommendations based on performance metrics.\n\n    :param metrics: Dictionary containing performance metrics.\n    :return: Dictionary containing scaling recommendations.\n    :raises ValueError: If metrics are invalid or incomplete.\"\"\"\n    pass\n",
    "function_schema_json": [
        {
            "name": "get_database_metrics",
            "description": "Retrieves performance metrics for a specified database server.",
            "parameters": {
                "type": "object",
                "properties": {
                    "server_id": {
                        "type": "string",
                        "description": "The unique identifier of the database server."
                    },
                    "time_range": {
                        "type": "string",
                        "description": "Time range for metrics (e.g., '1h', '24h', '7d')."
                    }
                },
                "required": [
                    "server_id",
                    "time_range"
                ],
                "additionalProperties": false
            }
        },
        {
            "name": "analyze_query_performance",
            "description": "Analyzes database query performance from provided logs.",
            "parameters": {
                "type": "object",
                "properties": {
                    "query_logs": {
                        "type": "array",
                        "description": "List of query log entries."
                    }
                },
                "required": [
                    "query_logs"
                ],
                "additionalProperties": false
            }
        },
        {
            "name": "get_scaling_recommendations",
            "description": "Provides scaling recommendations based on performance metrics.",
            "parameters": {
                "type": "object",
                "properties": {
                    "metrics": {
                        "type": "object",
                        "description": "Dictionary containing performance metrics."
                    }
                },
                "required": [
                    "metrics"
                ],
                "additionalProperties": false
            }
        }
    ],
    "mock_functions": "def get_database_metrics(server_id: str, time_range: str) -> dict:\n    \"\"\"\n    Retrieves performance metrics for a specified database server.\n    \n    :param server_id: The unique identifier of the database server.\n    :param time_range: Time range for metrics (e.g., '1h', '24h', '7d').\n    :return: Dictionary containing performance metrics.\n    :raises ValueError: If server_id is invalid or time_range is unsupported.\n    \"\"\"\n    if not server_id or not time_range:\n        raise ValueError(\"Server ID and time range must be provided.\")\n    \n    if server_id == \"db-prod-01\" and time_range in ['1h', '24h', '7d']:\n        return {\n            \"latency\": 250.5,  # milliseconds\n            \"cpu_utilization\": 85.5,  # percentage\n            \"memory_usage\": 78.3,  # percentage\n            \"disk_io\": 750.2,  # IOPS\n            \"active_connections\": 1250,\n            \"slow_queries\": 45\n        }\n    raise ValueError(\"Invalid server ID or time range.\")\ndef analyze_query_performance(query_logs: list) -> dict:\n    \"\"\"\n    Analyzes database query performance from provided logs.\n    \n    :param query_logs: List of query log entries.\n    :return: Dictionary containing query analysis results.\n    :raises ValueError: If query_logs is empty or invalid.\n    \"\"\"\n    if not query_logs or not isinstance(query_logs, list):\n        raise ValueError(\"Valid query logs must be provided.\")\n    \n    sample_logs = [\n        \"SELECT * FROM users WHERE id = 1\",\n        \"SELECT * FROM orders WHERE created_at > '2023-01-01'\"\n    ]\n    \n    if any(log in sample_logs for log in query_logs):\n        return {\n            \"slow_queries\": [\n                {\n                    \"query\": \"SELECT * FROM orders WHERE created_at > '2023-01-01'\",\n                    \"avg_execution_time\": 2.5,  # seconds\n                    \"frequency\": 120,  # per hour\n                    \"optimization_suggestions\": [\"Add index on created_at column\"]\n                }\n            ],\n            \"total_analyzed\": len(query_logs),\n            \"performance_impact\": \"high\"\n        }\n    return {}\ndef get_scaling_recommendations(metrics: dict) -> dict:\n    \"\"\"\n    Provides scaling recommendations based on performance metrics.\n    \n    :param metrics: Dictionary containing performance metrics.\n    :return: Dictionary containing scaling recommendations.\n    :raises ValueError: If metrics are invalid or incomplete.\n    \"\"\"\n    required_keys = [\"latency\", \"cpu_utilization\", \"memory_usage\"]\n    if not all(key in metrics for key in required_keys):\n        raise ValueError(\"Metrics must contain latency, CPU, and memory data.\")\n    \n    if metrics[\"cpu_utilization\"] > 80 and metrics[\"latency\"] > 200:\n        return {\n            \"recommendation\": \"scale_up\",\n            \"current_instance\": \"db.r5.2xlarge\",\n            \"suggested_instance\": \"db.r5.4xlarge\",\n            \"estimated_cost_increase\": 250.0,  # USD per month\n            \"expected_performance_gain\": 40,  # percentage\n            \"implementation_priority\": \"high\"\n        }\n    return {\n        \"recommendation\": \"optimize\",\n        \"suggested_actions\": [\"Query optimization\", \"Index tuning\"]\n    }",
    "user_query": "This is Emily from the startup. Please fetch the last 24 hours of performance metrics for 'db-prod-01' and provide scaling recommendations based on the CPU and latency information.",
    "checklist": {
        "functions": [
            "get_database_metrics",
            "get_scaling_recommendations"
        ],
        "values": [
            {
                "latency": 250.5,
                "cpu_utilization": 85.5,
                "memory_usage": 78.3,
                "disk_io": 750.2,
                "active_connections": 1250,
                "slow_queries": 45
            },
            {
                "recommendation": "scale_up",
                "current_instance": "db.r5.2xlarge",
                "suggested_instance": "db.r5.4xlarge",
                "estimated_cost_increase": 250,
                "expected_performance_gain": 40,
                "implementation_priority": "high"
            }
        ]
    }
}

We generated our training data synthetically with two primary objectives:

  • Robust performance on out-of-distribution (OOD) queries
  • Ability to solve complex, multi-tool problems in a single shot

A major challenge was creating a comprehensive curriculum for better generalization. We focused on real-world use cases, particularly emphasizing developer-centric scenarios since they constitute a significant portion of agentic requests.

Traditional approaches used curriculum elements as seeds for generating user queries. However, this method led to several issues in our context:

  • Incorrect function implementations
  • Infeasible mock logic
  • Inaccurate checklist items
  • User queries lacking sufficient information for proper function parameter selection

To address these challenges, we developed a scenario-first approach. We generate detailed scenarios based on curriculum items, including:

  • User background information
  • Context-specific details
  • Relevant supplementary information

This approach enables us to create both mock functions and user queries with sufficient context, avoiding:

  • Infeasible logic in mock functions
  • Information gaps in user queries

Data Validations

Our most significant improvements came from implementing fine-tuned validators and execution feedback-based validation similar to RLEF [4].

While O1 excelled at validating mock functions and scenario feasibility, it wasn't economically viable to validate the entire training dataset with it. Instead, we:

  1. Generated a validation dataset using O1
  2. Experimented with different models and validation methods
  3. Compared their performance against O1's validations

Beam Search with Process Reward Models

Our first approach involved scaling Test-Time Compute (TTC) using smaller models:

  • Llama3.1 8B achieved 38% agreement with O1's validation set
  • Using TTC with a beam size of 16 and Qwen2.5-Coder-32B-Instruct as the process reward model improved performance to ~65%
  • Beam sizes up to 64 showed improvements but came with significant computational overhead

In-Context Learning

Second approach followed was to use in-context learning (ICL) for bootstrapping reasoning capabilities in more cost-effective models. We created a dSPY-optimized few-shot prompt using O1 outputs and established a model pool consisting of Qwen2.5-Coder-32B-Instruct and Claude Sonnet. The system generated validation outputs with detailed rationales, routing simpler examples to smaller models while directing complex cases to larger models. This hierarchical approach achieved approximately 80% agreement with O1's validations while maintaining computational efficiency.

Code Execution

The final validation step involved implementing an execution feedback loop. Using Qwen2.5-Coder-32B-Instruct, we executed the generated solutions, collecting both stack traces and checklist scores as feedback. This process was iterated up to three times per problem. We retained only the entries that achieved a checklist output score above 0.75, ensuring high-quality solutions in our final dataset. This execution-based validation helped eliminate solutions that were syntactically correct but failed to meet the functional requirements of the tasks.

Models

We've trained two models so far, on Qwen2.5-Coder-3B-Instruct and Qwen2.5-Coder-7B-Instruct. Our models are called Dria-Agent-α, which are the first generation of agentic models to be released by Dria. The models, Dria-Agent-α-3B and Dria-Agent-α-7B, are available on Hugging Face.

Future Work

This is the first iteration of our framework, and we're already working on the next iteration, which will involve methods from RLEF[4] and rStar-Math[5]. These models are released to showcase the capabilities of Pythonic function calling, and to lead the way for the future generation of Dria-Agent models.

References