Skip to content

Commit 63899dd

Browse files
1 parent a98b1bb commit 63899dd

File tree

7 files changed

+154
-0
lines changed

7 files changed

+154
-0
lines changed

_blog.yml

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4905,3 +4905,14 @@
49054905
- mlx
49064906
- community
49074907
- open-source
4908+
4909+
- local: intel-deepmath
4910+
date: Nov 20, 2025
4911+
tags:
4912+
- llm
4913+
- reasoning
4914+
- agents
4915+
- math
4916+
- grpo
4917+
4918+

assets/intel-deepmath/banner.png

570 KB
Loading
167 KB
Loading
112 KB
Loading
86.5 KB
Loading
30.7 KB
Loading

intel-deepmath.md

Lines changed: 143 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,143 @@
1+
---
2+
title: "DeepMath: A Lightweight Math Reasoning Agent for LLMs"
3+
thumbnail: /blog/assets/intel-deepmath/banner.png
4+
authors:
5+
- user: danf
6+
guest: true
7+
org: Intel
8+
---
9+
10+
<img src="assets/intel-deepmath/deepmath-figure.jpg" style="width:600" alt="An LLM is using a calculator to answer questions." />
11+
12+
# DeepMath: A Lightweight Math Reasoning Agent for LLMs
13+
14+
*By Intel AI — Daniel Fleischer, Moshe Berchansky, Moshe Wasserblat*
15+
16+
17+
Large language models (LLMs) have made impressive strides in reasoning tasks, yet mathematical problem-solving remains a challenge. Traditional "chain-of-thought" reasoning often produces verbose explanations and error-prone arithmetic. **DeepMath** tackles this by combining a small Python executor with a fine-tuned LLM, enabling concise, computation-driven reasoning.
18+
19+
## The Big Idea
20+
21+
DeepMath is built on **Qwen3-4B Thinking** and fine-tuned with **GRPO (Group Relative Policy Optimization)**. Instead of verbose text, the model emits **tiny Python snippets** for intermediate steps, runs them in a secure sandbox, and folds the results back into its reasoning, reducing errors and output length.
22+
23+
✅ No file I/O, no network calls, strict timeouts.
24+
25+
✅ Safe, deterministic, and auditable.
26+
27+
We evaluate DeepMath on four math datasets: **MATH500, AIME, HMMT, and HLE,** and show that:
28+
29+
- The math agent alone improves accuracy and reduces verbosity.
30+
31+
- GRPO training alone biases outputs toward brevity and correctness.
32+
33+
- Combining the agent with GRPO yields the largest gains.
34+
35+
👉 Code and evaluation scripts: <https://github.com/IntelLabs/DeepMath>
36+
👉 Model: <https://huggingface.co/Intel/deepmath-v1>
37+
38+
## Why DeepMath?
39+
40+
LLMs often struggle with numeric precision and produce unnecessarily long reasoning chains. Two opportunities stand out:
41+
42+
1. **Offload deterministic computation** to a safe executor.
43+
44+
2. **Train models to prefer concise, computation-oriented traces** over verbose text.
45+
46+
DeepMath implements both. The model learns to generate short Python snippets, which are executed in a sandbox and reintegrated into the context. GRPO fine-tuning encourages this behavior by rewarding correctness and encouraging shoter outputs.
47+
48+
## How It Works
49+
50+
- Base model: [Qwen3-4B Thinking](https://huggingface.co/Qwen/Qwen3-4B-Thinking-2507).
51+
- Executor constraints: sandboxed environment, allow-list of imported modules, per-snippet timeout.
52+
- Inference: based on [SmolAgents](https://github.com/huggingface/smolagents/), a math agent was created. vLLM is used as the inference engine.
53+
- Training: based on the GRPO trainer in [TRL](https://github.com/huggingface/trl), we modified TRL's vLLM client and server to generate GRPO completions using our DeepMath agent.
54+
55+
<figure>
56+
<img src="assets/intel-deepmath/trl-grpo-vllm-deepmath.png" style="width:400" alt="Changes to vLLM client and server in TRL library." />
57+
<figcaption><p>Figure 1: The vLLM client and server were modified to use the DeepMath agent in generating the candidates, while using the vLLM backend.</p></figcaption>
58+
</figure>
59+
60+
- **Agent Interface:** During inference, the model can output normal tokens or special agent calls containing Python snippets.
61+
62+
- **Execution:** Snippets run in a sandboxed environment with strict safety constraints (no file I/O, no network, timeouts).
63+
64+
- **Design Goals:**
65+
66+
- **Concision:** Replace multi-line textual calculations with short, focused snippets.
67+
68+
- **Determinism & Safety:** Enforce strict execution limits.
69+
70+
- **Interpretability:** Snippets are readable and auditable.
71+
72+
<figure>
73+
<img src="assets/intel-deepmath/output-example.png" style="width:700" alt="Output example: it contains a short python snippet as well as its output which is used in the reasoning process."/>
74+
<figcaption><p>Figure 2: Output example where python code is generated, evaluated and the answer is inserted into the trace and used for context.</p></figcaption>
75+
</figure>
76+
77+
## Training with GRPO
78+
79+
We fine-tune the model using **GRPO**, a reward-based optimization that balances:
80+
81+
- **Accuracy Reward:** +1 for correct answers.
82+
83+
- **Using code snippets:** +1 for generating code snippets, weighted 10:1 vs. the accuracy reward.
84+
85+
- **Length reduction:** shorter lengths are encouraged by limiting the GRPO completion candidates to 5k tokens.
86+
87+
- **Temperature Scheduling:** We implemented linear temperature scheduling (T=1.2 → T=0.7) to balance exploration and stability during training. This approach aims to enhance experimentation during the initial training phases, subsequently reducing the temperature as we refine our proficiency in the skill.
88+
89+
- **In-context Learning**: we include 4 solved examples where the trace contains agent calls and executor outputs, so the model learns the syntax and the call/response pattern.
90+
91+
- **Dataset**: we used [OpenMathReasoning](https://huggingface.co/datasets/nvidia/OpenMathReasoning) dataset, the tool-usage subset. Note that GRPO only uses the <u>problem</u>, not the solution in the data. Choosing this dataset ensures problems benefit form tool use.
92+
93+
## Evaluation
94+
95+
We benchmarked DeepMath against baselines on four datasets. Metrics include:
96+
97+
- **majority@16** (robustness across samples).
98+
99+
- **Mean output length** (brevity).
100+
101+
<img src="assets/intel-deepmath/main-results.png" style="width:800" alt="Main results table."/>
102+
103+
**Key Insight:** DeepMath reduces output length by up to **66%** while improving accuracy on challenging datasets.
104+
105+
## Why It Matters
106+
107+
- **Accuracy:** Offloading computation reduces arithmetic errors.
108+
109+
- **Efficiency:** Shorter outputs mean faster inference and easier interpretability.
110+
111+
- **Safety:** Sandbox execution mitigates risks of running arbitrary code.
112+
113+
## Conclusion
114+
115+
DeepMath demonstrates a practical and lightweight way to combine a small executor with an LLM and to train the model to prefer short, computation-driven traces. Offloading deterministic computation reduces arithmetic and numerical errors and shortens traces, and GRPO fine-tuning further encourages concise, correct answers. The result is a more accurate and more interpretable math-solving agent without requiring a massive model or heavyweight external tools.
116+
117+
## Try It Yourself
118+
119+
Check out the GitHub repo and share your feedback! Contributions welcome. 🚀
120+
121+
<https://github.com/intel/DeepMath>.
122+
123+
## Citation
124+
125+
If you use DeepMath in your research, please cite:
126+
127+
```bibtex
128+
@software{deepmath2025,
129+
author = {Fleischer, Daniel and Berchansky, Moshe and Wasserblat, Moshe},
130+
title = {DeepMath: A Lightweight Math Reasoning Agent for LLMs},
131+
year = {2025},
132+
publisher = {Intel AI Labs},
133+
url = {https://github.com/IntelLabs/DeepMath}
134+
}
135+
```
136+
137+
## Limitations & Future Work
138+
139+
- **Scope**: we focused on a small model and on mathematical reasoning.
140+
141+
- **Generalization**: evaluated on contest-style math; results may not transfer to open-ended mathematical creativity or formal proofs.
142+
143+
- Executing generated code is inherently risky. DeepMath uses strict sandboxing and resource limits, but any deployment should carefully manage attack surfaces and enforce rate limits.

0 commit comments

Comments
 (0)