You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: intel-deepmath.md
+23-19Lines changed: 23 additions & 19 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -5,34 +5,38 @@ authors:
5
5
- user: danf
6
6
guest: true
7
7
org: Intel
8
+
- user: mber
9
+
guest: true
10
+
org: Intel
11
+
- user: moshew
12
+
guest: true
13
+
org: Intel
14
+
8
15
---
9
16
10
-
<imgsrc="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/intel-deepmath/deepmath-figure.jpg"style="width:600"alt="An LLM is using a calculator to answer questions." />
17
+
<imgsrc="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/intel-deepmath/deepmath-figure.jpg"width=700alt="An LLM is using a calculator to answer questions." />
11
18
12
19
# DeepMath: A Lightweight Math Reasoning Agent for LLMs
13
20
14
-
*By Intel AI — Daniel Fleischer, Moshe Berchansky, Moshe Wasserblat*
15
-
16
-
17
21
Large language models (LLMs) have made impressive strides in reasoning tasks, yet mathematical problem-solving remains a challenge. Traditional "chain-of-thought" reasoning often produces verbose explanations and error-prone arithmetic. **DeepMath** tackles this by combining a small Python executor with a fine-tuned LLM, enabling concise, computation-driven reasoning.
18
22
19
23
## The Big Idea
20
24
21
-
DeepMath is built on **Qwen3-4B Thinking** and fine-tuned with **GRPO (Group Relative Policy Optimization)**. Instead of verbose text, the model emits **tiny Python snippets** for intermediate steps, runs them in a secure sandbox, and folds the results back into its reasoning, reducing errors and output length.
25
+
DeepMath is built on **[Qwen3-4B Thinking](https://huggingface.co/Qwen/Qwen3-4B-Thinking-2507)** and fine-tuned with **GRPO (Group Relative Policy Optimization)**. Instead of verbose text, the model emits **tiny Python snippets** for intermediate steps, runs them in a secure sandbox, and folds the results back into its reasoning, reducing errors and output length.
22
26
23
27
✅ No file I/O, no network calls, strict timeouts.
24
28
25
29
✅ Safe, deterministic, and auditable.
26
30
27
-
We evaluate DeepMath on four math datasets: **MATH500, AIME, HMMT, and HLE,** and show that:
31
+
We evaluate DeepMath on four math datasets: **[MATH500](https://huggingface.co/datasets/HuggingFaceH4/MATH-500), [AIME](https://huggingface.co/datasets/opencompass/AIME2025), [HMMT](https://huggingface.co/datasets/MathArena/hmmt_feb_2025), and [HLE](https://huggingface.co/datasets/cais/hle),** and show that:
28
32
29
33
- The math agent alone improves accuracy and reduces verbosity.
30
34
31
35
- GRPO training alone biases outputs toward brevity and correctness.
32
36
33
37
- Combining the agent with GRPO yields the largest gains.
34
38
35
-
👉 Code and evaluation scripts: <https://github.com/IntelLabs/DeepMath>
39
+
👉 Code and evaluation scripts: <https://github.com/IntelLabs/DeepMath>\
@@ -52,10 +56,10 @@ DeepMath implements both. The model learns to generate short Python snippets, wh
52
56
- Inference: based on [SmolAgents](https://github.com/huggingface/smolagents/), a math agent was created. vLLM is used as the inference engine.
53
57
- Training: based on the GRPO trainer in [TRL](https://github.com/huggingface/trl), we modified TRL's vLLM client and server to generate GRPO completions using our DeepMath agent.
54
58
55
-
<figure>
56
-
<imgsrc="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/intel-deepmath/trl-grpo-vllm-deepmath.png"style="width:400"alt="Changes to vLLM client and server in TRL library." />
57
-
<figcaption><p>Figure 1: The vLLM client and server were modified to use the DeepMath agent in generating the candidates, while using the vLLM backend.</p></figcaption>
58
-
</figure>
59
+
<palign="center">
60
+
<imgsrc="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/intel-deepmath/trl-grpo-vllm-deepmath.png"width=600alt="Changes to vLLM client and server in TRL library." /><br>
61
+
<em>Figure 1: The vLLM client and server were modified to use the DeepMath agent in generating the candidates, while using the vLLM backend.</em>
62
+
</p>
59
63
60
64
-**Agent Interface:** During inference, the model can output normal tokens or special agent calls containing Python snippets.
61
65
@@ -69,10 +73,10 @@ DeepMath implements both. The model learns to generate short Python snippets, wh
69
73
70
74
-**Interpretability:** Snippets are readable and auditable.
71
75
72
-
<figure>
73
-
<imgsrc="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/intel-deepmath/output-example.png"style="width:700"alt="Output example: it contains a short python snippet as well as its output which is used in the reasoning process."/>
74
-
<figcaption><p>Figure 2: Output example where python code is generated, evaluated and the answer is inserted into the trace and used for context.</p></figcaption>
75
-
</figure>
76
+
<palign="center">
77
+
<imgsrc="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/intel-deepmath/output-example.png"width=800alt="Output example: it contains a short python snippet as well as its output which is used in the reasoning process."/><br>
78
+
<em>Figure 2: Output example where python code is generated, evaluated and the answer is inserted into the trace and used for context.</em>
79
+
</p>
76
80
77
81
## Training with GRPO
78
82
@@ -98,7 +102,9 @@ We benchmarked DeepMath against baselines on four datasets. Metrics include:
0 commit comments