Skip to content

Commit 4d2d301

Browse files
1 parent a98b1bb commit 4d2d301

15 files changed

+439
-9
lines changed

1_58_llm_extreme_quantization.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -59,7 +59,7 @@ model = AutoModelForCausalLM.from_pretrained(
5959
)
6060
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
6161

62-
input_text = "Daniel went back to the the the garden. Mary travelled to the kitchen. Sandra journeyed to the kitchen. Sandra went to the hallway. John went to the bedroom. Mary went back to the garden. Where is Mary?\nAnswer:"
62+
input_text = "Daniel went back to the garden. Mary travelled to the kitchen. Sandra journeyed to the kitchen. Sandra went to the hallway. John went to the bedroom. Mary went back to the garden. Where is Mary?\nAnswer:"
6363

6464
input_ids = tokenizer.encode(input_text, return_tensors="pt").cuda()
6565
output = model.generate(input_ids, max_new_tokens=10)

_blog.yml

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4905,3 +4905,30 @@
49054905
- mlx
49064906
- community
49074907
- open-source
4908+
4909+
- local: open-asr-leaderboard
4910+
date: Nov 21, 2025
4911+
tags:
4912+
- audio
4913+
- speech
4914+
- leaderboard
4915+
4916+
- local: rapidfireai
4917+
date: Nov 21, 2025
4918+
tags:
4919+
- llm
4920+
- experimentation
4921+
- fine-tuning
4922+
- post-training
4923+
- trl
4924+
- rapidfireai
4925+
4926+
- local: intel-deepmath
4927+
date: Nov 25, 2025
4928+
tags:
4929+
- llm
4930+
- reasoning
4931+
- agents
4932+
- math
4933+
- grpo
4934+

anylanguagemodel.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
title: "Introducing AnyLanguageModel: One API for Local and Remote LLMs on Apple Platforms"
3-
thumbnail: /assets/anylanguagemodel/banner.png
3+
thumbnail: /blog/assets/anylanguagemodel/banner.png
44
authors:
55
- user: mattt
66
guest: true

assets/intel-deepmath/banner.png

570 KB
Loading
112 KB
Loading

assets/rapidfireai/thumbnail.png

450 KB
Loading

fine-tune-wav2vec2-english.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -627,7 +627,7 @@ class DataCollatorCTCWithPadding:
627627
Data collator that will dynamically pad the inputs received.
628628
Args:
629629
processor (:class:`~transformers.Wav2Vec2Processor`)
630-
The processor used for proccessing the data.
630+
The processor used for processing the data.
631631
padding (:obj:`bool`, :obj:`str` or :class:`~transformers.tokenization_utils_base.PaddingStrategy`, `optional`, defaults to :obj:`True`):
632632
Select a strategy to pad the returned sequences (according to the model's padding side and padding index)
633633
among:

fine-tune-xlsr-wav2vec2.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -786,7 +786,7 @@ class DataCollatorCTCWithPadding:
786786
Data collator that will dynamically pad the inputs received.
787787
Args:
788788
processor (:class:`~transformers.Wav2Vec2Processor`)
789-
The processor used for proccessing the data.
789+
The processor used for processing the data.
790790
padding (:obj:`bool`, :obj:`str` or :class:`~transformers.tokenization_utils_base.PaddingStrategy`, `optional`, defaults to :obj:`True`):
791791
Select a strategy to pad the returned sequences (according to the model's padding side and padding index)
792792
among:

intel-deepmath.md

Lines changed: 143 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,143 @@
1+
---
2+
title: "DeepMath: A Lightweight Math Reasoning Agent for LLMs"
3+
thumbnail: /blog/assets/intel-deepmath/banner.png
4+
authors:
5+
- user: danf
6+
guest: true
7+
org: Intel
8+
---
9+
10+
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/intel-deepmath/deepmath-figure.jpg" style="width:600" alt="An LLM is using a calculator to answer questions." />
11+
12+
# DeepMath: A Lightweight Math Reasoning Agent for LLMs
13+
14+
*By Intel AI — Daniel Fleischer, Moshe Berchansky, Moshe Wasserblat*
15+
16+
17+
Large language models (LLMs) have made impressive strides in reasoning tasks, yet mathematical problem-solving remains a challenge. Traditional "chain-of-thought" reasoning often produces verbose explanations and error-prone arithmetic. **DeepMath** tackles this by combining a small Python executor with a fine-tuned LLM, enabling concise, computation-driven reasoning.
18+
19+
## The Big Idea
20+
21+
DeepMath is built on **Qwen3-4B Thinking** and fine-tuned with **GRPO (Group Relative Policy Optimization)**. Instead of verbose text, the model emits **tiny Python snippets** for intermediate steps, runs them in a secure sandbox, and folds the results back into its reasoning, reducing errors and output length.
22+
23+
✅ No file I/O, no network calls, strict timeouts.
24+
25+
✅ Safe, deterministic, and auditable.
26+
27+
We evaluate DeepMath on four math datasets: **MATH500, AIME, HMMT, and HLE,** and show that:
28+
29+
- The math agent alone improves accuracy and reduces verbosity.
30+
31+
- GRPO training alone biases outputs toward brevity and correctness.
32+
33+
- Combining the agent with GRPO yields the largest gains.
34+
35+
👉 Code and evaluation scripts: <https://github.com/IntelLabs/DeepMath>
36+
👉 Model: <https://huggingface.co/Intel/deepmath-v1>
37+
38+
## Why DeepMath?
39+
40+
LLMs often struggle with numeric precision and produce unnecessarily long reasoning chains. Two opportunities stand out:
41+
42+
1. **Offload deterministic computation** to a safe executor.
43+
44+
2. **Train models to prefer concise, computation-oriented traces** over verbose text.
45+
46+
DeepMath implements both. The model learns to generate short Python snippets, which are executed in a sandbox and reintegrated into the context. GRPO fine-tuning encourages this behavior by rewarding correctness and encouraging shoter outputs.
47+
48+
## How It Works
49+
50+
- Base model: [Qwen3-4B Thinking](https://huggingface.co/Qwen/Qwen3-4B-Thinking-2507).
51+
- Executor constraints: sandboxed environment, allow-list of imported modules, per-snippet timeout.
52+
- Inference: based on [SmolAgents](https://github.com/huggingface/smolagents/), a math agent was created. vLLM is used as the inference engine.
53+
- Training: based on the GRPO trainer in [TRL](https://github.com/huggingface/trl), we modified TRL's vLLM client and server to generate GRPO completions using our DeepMath agent.
54+
55+
<figure>
56+
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/intel-deepmath/trl-grpo-vllm-deepmath.png" style="width:400" alt="Changes to vLLM client and server in TRL library." />
57+
<figcaption><p>Figure 1: The vLLM client and server were modified to use the DeepMath agent in generating the candidates, while using the vLLM backend.</p></figcaption>
58+
</figure>
59+
60+
- **Agent Interface:** During inference, the model can output normal tokens or special agent calls containing Python snippets.
61+
62+
- **Execution:** Snippets run in a sandboxed environment with strict safety constraints (no file I/O, no network, timeouts).
63+
64+
- **Design Goals:**
65+
66+
- **Concision:** Replace multi-line textual calculations with short, focused snippets.
67+
68+
- **Determinism & Safety:** Enforce strict execution limits.
69+
70+
- **Interpretability:** Snippets are readable and auditable.
71+
72+
<figure>
73+
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/intel-deepmath/output-example.png" style="width:700" alt="Output example: it contains a short python snippet as well as its output which is used in the reasoning process."/>
74+
<figcaption><p>Figure 2: Output example where python code is generated, evaluated and the answer is inserted into the trace and used for context.</p></figcaption>
75+
</figure>
76+
77+
## Training with GRPO
78+
79+
We fine-tune the model using **GRPO**, a reward-based optimization that balances:
80+
81+
- **Accuracy Reward:** +1 for correct answers.
82+
83+
- **Using code snippets:** +1 for generating code snippets, weighted 10:1 vs. the accuracy reward.
84+
85+
- **Length reduction:** shorter lengths are encouraged by limiting the GRPO completion candidates to 5k tokens.
86+
87+
- **Temperature Scheduling:** We implemented linear temperature scheduling (T=1.2 → T=0.7) to balance exploration and stability during training. This approach aims to enhance experimentation during the initial training phases, subsequently reducing the temperature as we refine our proficiency in the skill.
88+
89+
- **In-context Learning**: we include 4 solved examples where the trace contains agent calls and executor outputs, so the model learns the syntax and the call/response pattern.
90+
91+
- **Dataset**: we used [OpenMathReasoning](https://huggingface.co/datasets/nvidia/OpenMathReasoning) dataset, the tool-usage subset. Note that GRPO only uses the <u>problem</u>, not the solution in the data. Choosing this dataset ensures problems benefit form tool use.
92+
93+
## Evaluation
94+
95+
We benchmarked DeepMath against baselines on four datasets. Metrics include:
96+
97+
- **majority@16** (robustness across samples).
98+
99+
- **Mean output length** (brevity).
100+
101+
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/intel-deepmath/main-results.png" style="width:800" alt="Main results table."/>
102+
103+
**Key Insight:** DeepMath reduces output length by up to **66%** while improving accuracy on challenging datasets.
104+
105+
## Why It Matters
106+
107+
- **Accuracy:** Offloading computation reduces arithmetic errors.
108+
109+
- **Efficiency:** Shorter outputs mean faster inference and easier interpretability.
110+
111+
- **Safety:** Sandbox execution mitigates risks of running arbitrary code.
112+
113+
## Conclusion
114+
115+
DeepMath demonstrates a practical and lightweight way to combine a small executor with an LLM and to train the model to prefer short, computation-driven traces. Offloading deterministic computation reduces arithmetic and numerical errors and shortens traces, and GRPO fine-tuning further encourages concise, correct answers. The result is a more accurate and more interpretable math-solving agent without requiring a massive model or heavyweight external tools.
116+
117+
## Try It Yourself
118+
119+
Check out the GitHub repo and share your feedback! Contributions welcome. 🚀
120+
121+
<https://github.com/intel/DeepMath>.
122+
123+
## Citation
124+
125+
If you use DeepMath in your research, please cite:
126+
127+
```bibtex
128+
@software{deepmath2025,
129+
author = {Fleischer, Daniel and Berchansky, Moshe and Wasserblat, Moshe},
130+
title = {DeepMath: A Lightweight Math Reasoning Agent for LLMs},
131+
year = {2025},
132+
publisher = {Intel AI Labs},
133+
url = {https://github.com/IntelLabs/DeepMath}
134+
}
135+
```
136+
137+
## Limitations & Future Work
138+
139+
- **Scope**: we focused on a small model and on mathematical reasoning.
140+
141+
- **Generalization**: evaluated on contest-style math; results may not transfer to open-ended mathematical creativity or formal proofs.
142+
143+
- Executing generated code is inherently risky. DeepMath uses strict sandboxing and resource limits, but any deployment should carefully manage attack surfaces and enforce rate limits.

mi300kernels.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -311,7 +311,7 @@ Why does this matter to us? Well, we have seen that loading data from VRAM has a
311311

312312
![The arithmetic intensity of two GEMMs](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/mi300kernels/gemm_AI.png)
313313

314-
In a skinny GEMM the number of rows of the output tile is limited and so is the the arithmetic intensity. Already this means that we are going to need to load a lot of data to compute an output tile. Furthermore, since we are using FP8 arithmetic, computation is quite fast, so we cannot rely on computation time to hide the latency of data loading. All in all, it would be ideal to have more threads in charge of loading data than threads in charge of computing the result.
314+
In a skinny GEMM the number of rows of the output tile is limited and so is the arithmetic intensity. Already this means that we are going to need to load a lot of data to compute an output tile. Furthermore, since we are using FP8 arithmetic, computation is quite fast, so we cannot rely on computation time to hide the latency of data loading. All in all, it would be ideal to have more threads in charge of loading data than threads in charge of computing the result.
315315

316316
To achieve this, we are going to use a technique called **warp specialization**. Instead having all warps in the thread block execute the same instructions, we are going to dedicate some warps to loading data only and some to computing the results only. The warps in charge of loading data are called **producers** and the ones that compute the results are named **consumers**. Producers and consumers work asynchronously: producers first load data from the VRAM, which is slow, and make it available to the consumers by storing it in a shared memory buffer. Until data is available in shared memory, the consumer is idle. After it data is made available, the consumer loads it from shared memory, which is fast, and computes the result.
317317
Coordination of producers and consumers is achieved through a queue stored in shared memory. When a producer finishes storing data in a shared memory buffer \\( i \\) , it changes the state of the \\( i \\) th variable of the queue to signal data is available there. The consumer is watching out for this, and begins loading data afterwards. When it is done, it changes the \\( i \\) th variable of the queue to signal that data can be written over in buffer \\( i \\) .

0 commit comments

Comments
 (0)