Open-Source AI Cookbook documentation

Data analyst agent: get your data’s insights in the blink of an eye ✨

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Open In Colab

Data analyst agent: get your data’s insights in the blink of an eye ✨

Authored by: Aymeric Roucher

This tutorial is advanced. You should have notions from this other cookbook first!

In this notebook we will make a data analyst agent: a Code agent armed with data analysis libraries, that can load and transform dataframes to extract insights from your data, and even plots the results!

Let’s say I want to analyze the data from the Kaggle Titanic challenge in order to predict the survival of individual passengers. But before digging into this myself, I want an autonomous agent to prepare the analysis for me by extracting trends and plotting some figures to find insights.

Let’s set up this system.

Run the line below to install required dependancies:

!pip install seaborn smolagents transformers -q -U

We first create the agent. We used a CodeAgent (read the documentation to learn more about types of agents), so we do not even need to give it any tools: it can directly run its code.

We simply make sure to let it use data science-related libraries by passing these in additional_authorized_imports: ["numpy", "pandas", "matplotlib.pyplot", "seaborn"].

In general when passing libraries in additional_authorized_imports, make sure they are installed on your local environment, since the python interpreter can only use libraries installed on your environment.

⚙ Our agent will be powered by meta-llama/Llama-3.1-70B-Instruct using HfApiModel class that uses HF’s Inference API: the Inference API allows to quickly and easily run any open model, for free!

from smolagents import HfApiModel, CodeAgent
from huggingface_hub import login
import os

login(os.getenv("HUGGINGFACEHUB_API_TOKEN"))

model = HfApiModel("meta-llama/Llama-3.1-70B-Instruct")

agent = CodeAgent(
    tools=[],
    model=model,
    additional_authorized_imports=["numpy", "pandas", "matplotlib.pyplot", "seaborn"],
    max_iterations=10,
)

Data analysis 📊🤔

Upon running the agent, we provide it with additional notes directly taken from the competition, and give these as a kwarg to the run method:

import os

os.mkdir("./figures")
>>> additional_notes = """
... ### Variable Notes
... pclass: A proxy for socio-economic status (SES)
... 1st = Upper
... 2nd = Middle
... 3rd = Lower
... age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
... sibsp: The dataset defines family relations in this way...
... Sibling = brother, sister, stepbrother, stepsister
... Spouse = husband, wife (mistresses and fiancés were ignored)
... parch: The dataset defines family relations in this way...
... Parent = mother, father
... Child = daughter, son, stepdaughter, stepson
... Some children travelled only with a nanny, therefore parch=0 for them.
... """

>>> analysis = agent.run(
...     """You are an expert data analyst.
... Please load the source file and analyze its content.
... According to the variables you have, begin by listing 3 interesting questions that could be asked on this data, for instance about specific correlations with survival rate.
... Then answer these questions one by one, by finding the relevant numbers.
... Meanwhile, plot some figures using matplotlib/seaborn and save them to the (already existing) folder './figures/': take care to clear each figure with plt.clf() before doing another plot.

... In your final answer: summarize these correlations and trends
... After each number derive real worlds insights, for instance: "Correlation between is_december and boredness is 1.3453, which suggest people are more bored in winter".
... Your final answer should have at least 3 numbered and detailed parts.
... """,
...     additional_args=dict(additional_notes=additional_notes, source_file="titanic/train.csv"),
... )
>>> print(analysis)
The analysis of the Titanic data reveals that socio-economic status and sex are significant factors in determining survival rates. Passengers with lower socio-economic status and males are less likely to survive. The age of a passenger has a minimal impact on their survival rate.

Impressive, isn’t it? You could also provide your agent with a visualizer tool to let it reflect upon its own graphs!

Data scientist agent: Run predictions 🛠️

👉 Now let’s dig further: we will let our model perform predictions on the data.

To do so, we also let it use sklearn in the additional_authorized_imports.

agent = CodeAgent(
    tools=[],
    model=model,
    additional_authorized_imports=[
        "numpy",
        "pandas",
        "matplotlib.pyplot",
        "seaborn",
        "sklearn",
    ],
    max_iterations=12,
)

output = agent.run(
    """You are an expert machine learning engineer.
Please train a ML model on "titanic/train.csv" to predict the survival for rows of "titanic/test.csv".
Output the results under './output.csv'.
Take care to import functions and modules before using them!
""",
    additional_args=dict(additional_notes=additional_notes + "\n" + analysis),
)

Even though the agent got a few errors, it managed to correctly solve the problem in the end!

The test predictions that the agent output above, once submitted to Kaggle, score 0.78229, which is #2824 out of 17,360, and better than what I had painfully achieved when first trying the challenge years ago.

Your result will vary, but anyway I find it very impressive to achieve this with an agent in a few seconds.

🚀 The above is just a naive attempt with agent data analyst: it can certainly be improved a lot to fit your use case better!

< > Update on GitHub