dabsteb blog #2641

lvwerra · 2025-02-04T15:22:41Z

Add an entry to _blog.yml.
Add a thumbnail. There are no requirements here, but there is a template if it's helpful.
Check you use a short title and blog path.
Upload any additional assets (such as images) to the Documentation Images repo. This is to reduce bloat in the GitHub base repo when cloning and pulling. Try to have small images to avoid a slow or expensive user experience.
Add metadata (such as authors) to your md file. You can also specify guest or org for the authors.
Ensure the publication date is correct.
Preview the content. A quick way is to paste the markdown content in https://huggingface.co/new-blog. Do not click publish, this is just a way to do an early check.

lewtun

Looks great, with some small suggestions

lewtun · 2025-02-04T15:32:17Z

dabstep.md

+
+At companies like Adyen, analysts tackle a spectrum of problems, from routine queries to complex workflows requiring creativity, precision, and iterative reasoning. Access to a capable data analysis agent that can automate simple and repetitive tasks and assist with complex tasks would allow analysts to work faster, reduce mental strain, and focus on solving more impactful problems. That would be a pivotal moment for many industries that need data analysis and insights, such as finance.
+
+Recent advancements in *agentic workflows* — where LLMs equipped with tools independently execute multi-step tasks — have shown tremendous promise across domains like coding, open QA, software engineering, and even Kaggle competitions. These systems aren’t just theoretical; they've been driving real-world productivity gains.


If time permits, you could add links to SWE-Bench, the automated Kaggle paper and possible the AI Scientist from Sakana

lewtun · 2025-02-04T15:34:48Z

dabstep.md

+The benchmark consists of two difficulty levels:
+
+- **Easy Level**: These tasks serve as warm-ups, helping to verify setups, integrations, and research direction. They typically require only a single structured dataset and minimal contextual knowledge. On average, humans achieve a 62% baseline on these tasks after 3+ hours of work, while a Llama 70B zero-shot prompt can exceed 90% accuracy.   
+- **Hard Level**: These tasks demand a more complex approach, involving multiple structured datasets and domain-specific knowledge. Unlike the easy level, they typically cannot be solved with a single-shot code generation and require multiple steps of reasoning.


Is the human baseline for this known?

dont think so

lewtun · 2025-02-04T15:35:27Z

dabstep.md

+
+Some quick comments on how we are hoping to encourage generalization with the benchmark.
+
+**Symbolic Reasoning:** In the spirit of GSM-Symbolic, tasks have been exploded in cardinality using permutations of time ranges, merchant names, etc. The rationale is to remove the chance of “lucky guesses” and validate core reasoning (repeatability of reasoning) and generalization.


Add a link to GSM-Symbolic?

dabstep.md

lewtun · 2025-02-04T15:39:42Z

dabstep.md

+
+## Getting Started and Infra
+
+We are mindful that doing agentic research by interacting with the benchmark requires an execution environment and involves costs. We are lowering the barrier by providing access to HuggingFace's **Inference API** and [**smolagents**](https://huggingface.co/docs/smolagents/en/index). With these tools, researchers get **1k free LLM requests daily** and access to a **secure local code execution environment**. 


Does this refer to PRO users or are you offering some special credits for the project? It's not clear to me what I need to do as a user/research to get the 1k free requests

lewtun · 2025-02-04T15:40:52Z

dabstep.md

+
+DS 1000 tests Python-based data analysis tasks sourced from StackOverflow, curated to avoid memorization. However, its tasks are short and single-shot, lacking real datasets and iterative reasoning. This limits its ability to evaluate end-to-end workflows or multimodal capabilities.
+
+**Acknowledgments:** Harm de Vries (Graidd), Arjun Guha (Northeastern University), Hanna van der Vlis (Adyen)


Maybe mention their explicit contributions?

lewtun · 2025-02-04T15:41:18Z

dabstep.md

+
+**Tasks:** The current tasks are very narrow and limited in scope, encompassing mostly fraud and payment fees. This is a subset of the real world, as there are many other dimensions and variables at play. In the future, we will expand the same benchmark, including tasks in the area of approval rates (issuer refusals), authentication drop-offs, and real-time situations over a wider time span such as seasonal components. This would test the capacity of agents to balance several variables at the same time and execute trade-offs on multiple dimensions.
+
+**Domains:** The benchmark currently revolves around tasks from the financial sector. However, we invite researchers and practitioners from other fields, such as health, biology, insurance, telecommunication etc. to contribute new subsets to the benchmark so we can evaluate the performance across many domains.


Do you have a link that points to "How to add a new domain" or similar?

Co-authored-by: lewtun <[email protected]>

dabsteb blog

35eb721

lvwerra requested a review from lewtun February 4, 2025 15:22

Merge remote-tracking branch 'origin/main' into dabstep

070dcbd

lewtun approved these changes Feb 4, 2025

View reviewed changes

lvwerra and others added 5 commits February 4, 2025 16:48

Update dabstep.md

b4be069

Co-authored-by: lewtun <[email protected]>

add feedback

39c52a0

analysis --> agent

021490e

analysis --> agent

7592c0a

fix author

b90bfe5

lvwerra merged commit 39d1eca into main Feb 4, 2025
1 check passed

lvwerra deleted the dabstep branch February 4, 2025 16:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dabsteb blog #2641

dabsteb blog #2641

lvwerra commented Feb 4, 2025

lewtun left a comment

lewtun Feb 4, 2025

lewtun Feb 4, 2025

lvwerra Feb 4, 2025

lewtun Feb 4, 2025

lewtun Feb 4, 2025

lewtun Feb 4, 2025

lewtun Feb 4, 2025


		At companies like Adyen, analysts tackle a spectrum of problems, from routine queries to complex workflows requiring creativity, precision, and iterative reasoning. Access to a capable data analysis agent that can automate simple and repetitive tasks and assist with complex tasks would allow analysts to work faster, reduce mental strain, and focus on solving more impactful problems. That would be a pivotal moment for many industries that need data analysis and insights, such as finance.

		Recent advancements in agentic workflows — where LLMs equipped with tools independently execute multi-step tasks — have shown tremendous promise across domains like coding, open QA, software engineering, and even Kaggle competitions. These systems aren’t just theoretical; they've been driving real-world productivity gains.


		Some quick comments on how we are hoping to encourage generalization with the benchmark.

		Symbolic Reasoning: In the spirit of GSM-Symbolic, tasks have been exploded in cardinality using permutations of time ranges, merchant names, etc. The rationale is to remove the chance of “lucky guesses” and validate core reasoning (repeatability of reasoning) and generalization.


		## Getting Started and Infra

		We are mindful that doing agentic research by interacting with the benchmark requires an execution environment and involves costs. We are lowering the barrier by providing access to HuggingFace's Inference API and [smolagents](https://huggingface.co/docs/smolagents/en/index). With these tools, researchers get 1k free LLM requests daily and access to a secure local code execution environment.


		DS 1000 tests Python-based data analysis tasks sourced from StackOverflow, curated to avoid memorization. However, its tasks are short and single-shot, lacking real datasets and iterative reasoning. This limits its ability to evaluate end-to-end workflows or multimodal capabilities.

		Acknowledgments: Harm de Vries (Graidd), Arjun Guha (Northeastern University), Hanna van der Vlis (Adyen)


		Tasks: The current tasks are very narrow and limited in scope, encompassing mostly fraud and payment fees. This is a subset of the real world, as there are many other dimensions and variables at play. In the future, we will expand the same benchmark, including tasks in the area of approval rates (issuer refusals), authentication drop-offs, and real-time situations over a wider time span such as seasonal components. This would test the capacity of agents to balance several variables at the same time and execute trade-offs on multiple dimensions.

		Domains: The benchmark currently revolves around tasks from the financial sector. However, we invite researchers and practitioners from other fields, such as health, biology, insurance, telecommunication etc. to contribute new subsets to the benchmark so we can evaluate the performance across many domains.

dabsteb blog #2641

dabsteb blog #2641

Conversation

lvwerra commented Feb 4, 2025

lewtun left a comment

Choose a reason for hiding this comment

lewtun Feb 4, 2025

Choose a reason for hiding this comment

lewtun Feb 4, 2025

Choose a reason for hiding this comment

lvwerra Feb 4, 2025

Choose a reason for hiding this comment

lewtun Feb 4, 2025

Choose a reason for hiding this comment

lewtun Feb 4, 2025

Choose a reason for hiding this comment

lewtun Feb 4, 2025

Choose a reason for hiding this comment

lewtun Feb 4, 2025

Choose a reason for hiding this comment