Skip to content

Commit 96c296d

Browse files
Add Intel DeepMath blog (#3180)
* DeepMath blog Code: https://github.com/intellabs/DeepMath Model: https://huggingface.co/Intel/deepmath-v1 * Authors, centering, links to models/datasets * Banner, review fixes * Reviewing the paper * Review session
1 parent 908d6bd commit 96c296d

File tree

3 files changed

+164
-0
lines changed

3 files changed

+164
-0
lines changed

_blog.yml

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4964,10 +4964,20 @@
49644964
- Gemini
49654965
- agents
49664966

4967+
- local: intel-deepmath
4968+
date: Dec 4, 2025
4969+
tags:
4970+
- llm
4971+
- reasoning
4972+
- agents
4973+
- math
4974+
- grpo
4975+
49674976
- local: swift-huggingface
49684977
date: Dec 5, 2025
49694978
tags:
49704979
- swift
49714980
- hub
49724981
- open-source
49734982
- community
4983+

assets/intel-deepmath/banner.png

184 KB
Loading

intel-deepmath.md

Lines changed: 154 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,154 @@
1+
---
2+
title: "DeepMath: A lightweight math reasoning Agent with SmolAgents"
3+
thumbnail: /blog/assets/intel-deepmath/banner.png
4+
authors:
5+
- user: danf
6+
guest: true
7+
org: Intel
8+
- user: mber
9+
guest: true
10+
org: Intel
11+
- user: moshew
12+
guest: true
13+
org: Intel
14+
---
15+
16+
<p align="center">
17+
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/intel-deepmath/deepmath-figure.jpg" width=700 alt="An LLM is using a calculator to answer questions." />
18+
</p>
19+
20+
# DeepMath: A lightweight math reasoning Agent with SmolAgents
21+
22+
*By Intel AI Software Group*
23+
24+
[DeepMath](https://huggingface.co/Intel/deepmath-v1) is an aligned math reasoning agent built on **[Qwen3-4B Thinking](https://huggingface.co/Qwen/Qwen3-4B-Thinking-2507)** and fine-tuned with **GRPO (Group Relative Policy Optimization)**. Instead of verbose text, the model emits **tiny Python snippets** for intermediate steps, runs them in a secure sandbox, and folds the results back into its reasoning, reducing errors and output length. The agent is implemented using the **[smolagents library](https://github.com/huggingface/smolagents)**.
25+
26+
We evaluate DeepMath on four math datasets: **[MATH500](https://huggingface.co/datasets/HuggingFaceH4/MATH-500), [AIME](https://huggingface.co/datasets/opencompass/AIME2025), [HMMT](https://huggingface.co/datasets/MathArena/hmmt_feb_2025), and [HLE](https://huggingface.co/datasets/cais/hle),** and show that:
27+
28+
- 🤖 The math agent alone reduces output lengths by up to 66%, while often improving accuracy.
29+
30+
- ⚡ GRPO training improves the agent performance even further, in almost all benchmarks.
31+
32+
👉 Code and evaluation scripts: <https://github.com/IntelLabs/DeepMath> \
33+
👉 Model: <https://huggingface.co/Intel/deepmath-v1>
34+
35+
## Why DeepMath?
36+
37+
Large language models (LLMs) have advanced reasoning capabilities, but mathematical problem-solving remains challenging; chain-of-thought traces can be lengthy and prone to arithmetic mistakes. Recent works[^1][^2] demonstrate that small models can reach strong performance, and other studies[^3] investigate tool use to improve reliability. What those papers generally do not emphasize is reducing trace verbosity or explicitly training models to prefer short, computation-oriented traces executed in a constrained, auditable environment.
38+
39+
We focused on two goals:
40+
41+
1. **Offload deterministic computation** to a safe executor.
42+
43+
2. **Train models to prefer concise, computation-oriented traces** over verbose text.
44+
45+
**DeepMath** tackles this by combining a small Python executor with a fine-tuned LLM, enabling concise, computation-driven reasoning. The model learns to generate short Python snippets, which are executed in a sandbox and reintegrated into the context. GRPO fine-tuning encourages this behavior by rewarding correctness and encouraging shorter outputs.
46+
47+
## How It Works
48+
49+
- Base model: [Qwen3-4B Thinking](https://huggingface.co/Qwen/Qwen3-4B-Thinking-2507).
50+
- Executor constraints: sandboxed environment, allow-list of imported modules, per-snippet timeout.
51+
- Inference: based on [smolagents](https://github.com/huggingface/smolagents/), a math agent was created. [vLLM](https://github.com/vllm-project/vLLM) is used as the inference engine.
52+
- Training: based on the GRPO trainer in [TRL](https://github.com/huggingface/trl), we modified TRL's vLLM client and server to generate GRPO completions using our DeepMath agent.
53+
54+
<p align="center">
55+
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/intel-deepmath/trl-grpo-vllm-deepmath.png" width=600 alt="Changes to vLLM client and server in TRL library." /><br>
56+
<em>Figure 1: The vLLM client and server were modified to use the DeepMath agent in generating the candidates, while using the vLLM backend.</em>
57+
</p>
58+
59+
- **Agent Interface:** During inference, the model can output normal tokens or special agent calls containing Python snippets.
60+
61+
- **Execution:** Snippets run in a sandboxed environment with strict safety constraints (no file I/O, no network, timeouts).
62+
63+
- **Design Goals:**
64+
65+
- **Concision:** Replace multi-line textual calculations with short, focused snippets.
66+
67+
- **Determinism & Safety:** Enforce strict execution limits.
68+
69+
- **Interpretability:** Snippets are readable and auditable.
70+
71+
<p align="center">
72+
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/intel-deepmath/output-example.png" width=800 alt="Output example: it contains a short python snippet as well as its output which is used in the reasoning process."/><br>
73+
<em>Figure 2: Output example where python code is generated, evaluated and the answer is inserted into the trace and used for context.</em>
74+
</p>
75+
76+
## Training with GRPO
77+
78+
We fine-tune the model using **GRPO**, a reward-based optimization that balances:
79+
80+
- **Accuracy Reward:** +1 for correct answers.
81+
82+
- **Using code snippets:** +1 for generating code snippets, weighted 10:1 vs. the accuracy reward.
83+
84+
- **Length reduction:** shorter lengths are encouraged by limiting the GRPO completion candidates to 5k tokens.
85+
86+
- **Temperature Scheduling:** We implemented linear temperature scheduling (T=1.2 → T=0.7) to balance exploration and stability during training. This approach aims to enhance experimentation during the initial training phases, subsequently reducing the temperature as we refine our proficiency in the skill.
87+
88+
- **In-context Learning**: we include 4 solved examples where the trace contains agent calls and executor outputs, so the model learns the syntax and the call/response pattern.
89+
90+
- **Dataset**: we used the Tool-Integrated Reasoning (TIR) subset of the [OpenMathReasoning](https://huggingface.co/datasets/nvidia/OpenMathReasoning) dataset. Note that GRPO only uses the <u>problem</u>, not the solution in the data. This dataset was chosen to ensure the problems benefit from the external tool.
91+
92+
## Evaluation
93+
94+
We benchmarked DeepMath against baselines on four datasets. Metrics include:
95+
96+
- **majority@16**: robustness across samples, as used in previous math reasoning works, see references.
97+
98+
- **Mean output length**: brevity.
99+
100+
<p align="center">
101+
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/intel-deepmath/main-results.png" width=1000 alt="Main results table."/>
102+
</p>
103+
104+
- We compare a baseline configuration ([Qwen3-4B-Thinking-2507](https://huggingface.co/Qwen/Qwen3-4B-Thinking-2507), no agenting) with our DeepMath model. As ablation, we evaluate the agentic framework we developed running with the untrained Qwen3 model, denoted by **+Agent**. Additionally, we examine whether the GRPO training (for agentic use) improves non-agentic inference, denoted by **+GRPO**. Thus the two ablations are independent, not additive.
105+
106+
- We observe the agentic inference reduces output lengths, with mixed accuracy results. The DeepMath model is both GRPO-trained and run in agentic mode, and shows the highest accuracy with shortened traces. We conclude **both GRPO training and agentic inference are needed** for best results.
107+
108+
**Key Insight:** DeepMath reduces output length by up to **66%** while improving accuracy on challenging datasets.
109+
110+
## Why It Matters
111+
112+
- **Accuracy:** Offloading computation reduces arithmetic errors.
113+
114+
- **Efficiency:** Shorter outputs mean faster inference and easier interpretability.
115+
116+
- **Safety:** Sandbox execution mitigates risks of running arbitrary code.
117+
118+
## Conclusion
119+
120+
DeepMath demonstrates a practical and lightweight way to combine a small executor with an LLM and to train the model to prefer short, computation-driven traces. Offloading deterministic computation reduces arithmetic and numerical errors and shortens traces, and GRPO fine-tuning further encourages concise, correct answers. The result is a more accurate and more interpretable math-solving agent without requiring a massive model or heavyweight external tools.
121+
122+
## Try It Yourself
123+
124+
Check out the [GitHub repo](https://github.com/IntelLabs/DeepMath) and share your feedback! Contributions welcome. 🚀
125+
126+
## Citation
127+
128+
If you use DeepMath in your research, please cite:
129+
130+
```bibtex
131+
@software{deepmath2025,
132+
author = {Fleischer, Daniel and Berchansky, Moshe and Wasserblat, Moshe},
133+
title = {DeepMath: A Lightweight Math Reasoning Agent for LLMs},
134+
year = {2025},
135+
publisher = {Intel AI Labs},
136+
url = {https://github.com/IntelLabs/DeepMath}
137+
}
138+
```
139+
140+
## Limitations & Future Work
141+
142+
- **Scope**: we focused on a small model and on mathematical reasoning.
143+
144+
- **Generalization**: evaluated on contest-style math; results may not transfer to open-ended mathematical creativity or formal proofs.
145+
146+
- Executing generated code is inherently risky. DeepMath uses strict sandboxing and resource limits, but any deployment should carefully manage attack surfaces and enforce rate limits.
147+
148+
## References
149+
150+
[^1]: Luo, Michael, Sijun Tan, Justin Wong, et al. 2025. “DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL.” <https://pretty-radio-b75.notion.site/DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2>
151+
152+
[^2]: Liu, Mingjie, Shizhe Diao, Ximing Lu, et al. 2025. “ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models.” arXiv:2505.24864. Preprint, arXiv, May 30. <https://doi.org/10.48550/arXiv.2505.24864>
153+
154+
[^3]: Moshkov, Ivan, Darragh Hanley, Ivan Sorokin, et al. 2025. “AIMO-2 Winning Solution: Building State-of-the-Art Mathematical Reasoning Models with OpenMathReasoning Dataset.” arXiv:2504.16891. Preprint, arXiv, April 23. <https://doi.org/10.48550/arXiv.2504.16891>

0 commit comments

Comments
 (0)