|
| 1 | +--- |
| 2 | +title: "DeepMath: A Lightweight Math Reasoning Agent for LLMs" |
| 3 | +thumbnail: /blog/assets/intel-deepmath/banner.png |
| 4 | +authors: |
| 5 | +- user: danf |
| 6 | + guest: true |
| 7 | + org: Intel |
| 8 | +--- |
| 9 | + |
| 10 | +<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/intel-deepmath/deepmath-figure.jpg" style="width:600" alt="An LLM is using a calculator to answer questions." /> |
| 11 | + |
| 12 | +# DeepMath: A Lightweight Math Reasoning Agent for LLMs |
| 13 | + |
| 14 | +*By Intel AI — Daniel Fleischer, Moshe Berchansky, Moshe Wasserblat* |
| 15 | + |
| 16 | + |
| 17 | +Large language models (LLMs) have made impressive strides in reasoning tasks, yet mathematical problem-solving remains a challenge. Traditional "chain-of-thought" reasoning often produces verbose explanations and error-prone arithmetic. **DeepMath** tackles this by combining a small Python executor with a fine-tuned LLM, enabling concise, computation-driven reasoning. |
| 18 | + |
| 19 | +## The Big Idea |
| 20 | + |
| 21 | +DeepMath is built on **Qwen3-4B Thinking** and fine-tuned with **GRPO (Group Relative Policy Optimization)**. Instead of verbose text, the model emits **tiny Python snippets** for intermediate steps, runs them in a secure sandbox, and folds the results back into its reasoning, reducing errors and output length. |
| 22 | + |
| 23 | +✅ No file I/O, no network calls, strict timeouts. |
| 24 | + |
| 25 | +✅ Safe, deterministic, and auditable. |
| 26 | + |
| 27 | +We evaluate DeepMath on four math datasets: **MATH500, AIME, HMMT, and HLE,** and show that: |
| 28 | + |
| 29 | +- The math agent alone improves accuracy and reduces verbosity. |
| 30 | + |
| 31 | +- GRPO training alone biases outputs toward brevity and correctness. |
| 32 | + |
| 33 | +- Combining the agent with GRPO yields the largest gains. |
| 34 | + |
| 35 | +👉 Code and evaluation scripts: <https://github.com/IntelLabs/DeepMath> |
| 36 | +👉 Model: <https://huggingface.co/Intel/deepmath-v1> |
| 37 | + |
| 38 | +## Why DeepMath? |
| 39 | + |
| 40 | +LLMs often struggle with numeric precision and produce unnecessarily long reasoning chains. Two opportunities stand out: |
| 41 | + |
| 42 | +1. **Offload deterministic computation** to a safe executor. |
| 43 | + |
| 44 | +2. **Train models to prefer concise, computation-oriented traces** over verbose text. |
| 45 | + |
| 46 | +DeepMath implements both. The model learns to generate short Python snippets, which are executed in a sandbox and reintegrated into the context. GRPO fine-tuning encourages this behavior by rewarding correctness and encouraging shoter outputs. |
| 47 | + |
| 48 | +## How It Works |
| 49 | + |
| 50 | +- Base model: [Qwen3-4B Thinking](https://huggingface.co/Qwen/Qwen3-4B-Thinking-2507). |
| 51 | +- Executor constraints: sandboxed environment, allow-list of imported modules, per-snippet timeout. |
| 52 | +- Inference: based on [SmolAgents](https://github.com/huggingface/smolagents/), a math agent was created. vLLM is used as the inference engine. |
| 53 | +- Training: based on the GRPO trainer in [TRL](https://github.com/huggingface/trl), we modified TRL's vLLM client and server to generate GRPO completions using our DeepMath agent. |
| 54 | + |
| 55 | +<figure> |
| 56 | +<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/intel-deepmath/trl-grpo-vllm-deepmath.png" style="width:400" alt="Changes to vLLM client and server in TRL library." /> |
| 57 | +<figcaption><p>Figure 1: The vLLM client and server were modified to use the DeepMath agent in generating the candidates, while using the vLLM backend.</p></figcaption> |
| 58 | +</figure> |
| 59 | + |
| 60 | +- **Agent Interface:** During inference, the model can output normal tokens or special agent calls containing Python snippets. |
| 61 | + |
| 62 | +- **Execution:** Snippets run in a sandboxed environment with strict safety constraints (no file I/O, no network, timeouts). |
| 63 | + |
| 64 | +- **Design Goals:** |
| 65 | + |
| 66 | + - **Concision:** Replace multi-line textual calculations with short, focused snippets. |
| 67 | + |
| 68 | + - **Determinism & Safety:** Enforce strict execution limits. |
| 69 | + |
| 70 | + - **Interpretability:** Snippets are readable and auditable. |
| 71 | + |
| 72 | +<figure> |
| 73 | +<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/intel-deepmath/output-example.png" style="width:700" alt="Output example: it contains a short python snippet as well as its output which is used in the reasoning process."/> |
| 74 | +<figcaption><p>Figure 2: Output example where python code is generated, evaluated and the answer is inserted into the trace and used for context.</p></figcaption> |
| 75 | +</figure> |
| 76 | + |
| 77 | +## Training with GRPO |
| 78 | + |
| 79 | +We fine-tune the model using **GRPO**, a reward-based optimization that balances: |
| 80 | + |
| 81 | +- **Accuracy Reward:** +1 for correct answers. |
| 82 | + |
| 83 | +- **Using code snippets:** +1 for generating code snippets, weighted 10:1 vs. the accuracy reward. |
| 84 | + |
| 85 | +- **Length reduction:** shorter lengths are encouraged by limiting the GRPO completion candidates to 5k tokens. |
| 86 | + |
| 87 | +- **Temperature Scheduling:** We implemented linear temperature scheduling (T=1.2 → T=0.7) to balance exploration and stability during training. This approach aims to enhance experimentation during the initial training phases, subsequently reducing the temperature as we refine our proficiency in the skill. |
| 88 | + |
| 89 | +- **In-context Learning**: we include 4 solved examples where the trace contains agent calls and executor outputs, so the model learns the syntax and the call/response pattern. |
| 90 | + |
| 91 | +- **Dataset**: we used [OpenMathReasoning](https://huggingface.co/datasets/nvidia/OpenMathReasoning) dataset, the tool-usage subset. Note that GRPO only uses the <u>problem</u>, not the solution in the data. Choosing this dataset ensures problems benefit form tool use. |
| 92 | + |
| 93 | +## Evaluation |
| 94 | + |
| 95 | +We benchmarked DeepMath against baselines on four datasets. Metrics include: |
| 96 | + |
| 97 | +- **majority@16** (robustness across samples). |
| 98 | + |
| 99 | +- **Mean output length** (brevity). |
| 100 | + |
| 101 | +<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/intel-deepmath/main-results.png" style="width:800" alt="Main results table."/> |
| 102 | + |
| 103 | +**Key Insight:** DeepMath reduces output length by up to **66%** while improving accuracy on challenging datasets. |
| 104 | + |
| 105 | +## Why It Matters |
| 106 | + |
| 107 | +- **Accuracy:** Offloading computation reduces arithmetic errors. |
| 108 | + |
| 109 | +- **Efficiency:** Shorter outputs mean faster inference and easier interpretability. |
| 110 | + |
| 111 | +- **Safety:** Sandbox execution mitigates risks of running arbitrary code. |
| 112 | + |
| 113 | +## Conclusion |
| 114 | + |
| 115 | +DeepMath demonstrates a practical and lightweight way to combine a small executor with an LLM and to train the model to prefer short, computation-driven traces. Offloading deterministic computation reduces arithmetic and numerical errors and shortens traces, and GRPO fine-tuning further encourages concise, correct answers. The result is a more accurate and more interpretable math-solving agent without requiring a massive model or heavyweight external tools. |
| 116 | + |
| 117 | +## Try It Yourself |
| 118 | + |
| 119 | +Check out the GitHub repo and share your feedback! Contributions welcome. 🚀 |
| 120 | + |
| 121 | +<https://github.com/intel/DeepMath>. |
| 122 | + |
| 123 | +## Citation |
| 124 | + |
| 125 | +If you use DeepMath in your research, please cite: |
| 126 | + |
| 127 | +```bibtex |
| 128 | +@software{deepmath2025, |
| 129 | + author = {Fleischer, Daniel and Berchansky, Moshe and Wasserblat, Moshe}, |
| 130 | + title = {DeepMath: A Lightweight Math Reasoning Agent for LLMs}, |
| 131 | + year = {2025}, |
| 132 | + publisher = {Intel AI Labs}, |
| 133 | + url = {https://github.com/IntelLabs/DeepMath} |
| 134 | +} |
| 135 | +``` |
| 136 | + |
| 137 | +## Limitations & Future Work |
| 138 | + |
| 139 | +- **Scope**: we focused on a small model and on mathematical reasoning. |
| 140 | + |
| 141 | +- **Generalization**: evaluated on contest-style math; results may not transfer to open-ended mathematical creativity or formal proofs. |
| 142 | + |
| 143 | +- Executing generated code is inherently risky. DeepMath uses strict sandboxing and resource limits, but any deployment should carefully manage attack surfaces and enforce rate limits. |
0 commit comments