Overview
The paper supplies reproducible artifacts and multi-benchmark evaluations showing robust improvements on math tasks; results are strong for research and prototypes but additional testing is needed for production safety and scale.
Citations13
Evidence Strength0.70
Confidence0.85
Risk Signals9
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 6/7
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 60%
Production readiness: 60%
Novelty: 70%
Why It Matters For Business
LLEMMA gives stronger math and formal-proving ability than other open base models while being fully open-source, enabling companies to build reproducible math tools and lower development cost for math-heavy applications.
Who Should Care
Summary TLDR
LLEMMA is an open-source adaptation of Code Llama trained on Proof-Pile-2 (55B tokens of math papers, math web pages, and math code). The authors release 7B and 34B parameter base models, the Proof-Pile-2 corpus, and code. On standard math benchmarks (MATH, GSM8k, OCW) LLEMMA substantially improves few-shot reasoning and tool-augmented problem solving vs. Code Llama and other open models. It also shows few-shot capability for autoformalization and tactic prediction in Isabelle and Lean. The paper includes data-overlap checks and analysis of memorization.
Problem Statement
General-purpose LLMs are not optimized for mathematical reasoning. The paper asks whether continued pretraining on a math-focused corpus improves a base model's ability to solve math problems, use computational tools, and assist formal theorem provers, while remaining open and reproducible.
Main Contribution
Train and release LLEMMA base models specialized for math: 7B and 34B parameters.
Release Proof-Pile-2: a 55B-token open corpus of math papers, filtered web math (OpenWebMath), and AlgebraicStack code (11B tokens).
Key Findings
Continued pretraining on Proof-Pile-2 improves few-shot math reasoning.
Tool use further improves solved-rate on math problems.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | LLEMMA-34B 51.5% | Code Llama-34B 29.6% | +21.9 pp | GSM8k few-shot | Table 1 GSM8k few-shot | Table 1 |
| Accuracy | LLEMMA-34B 25.0% | Code Llama-34B 12.2% | +12.8 pp | MATH few-shot (greedy) | Table 1 MATH few-shot | Table 1 |
What To Try In 7 Days
Run the LLEMMA-7B model on a sample of your math-related tasks to compare accuracy and latency.
Prototype a Python-execute loop: prompt -> generate code -> run -> verify numeric answers.
Download Proof-Pile-2 or AlgebraicStack and test a 1-2 epoch continued pretraining on your own model for domain fit checks.
Agent Features
Tool Use
Optimization Features
Token Efficiency
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Code URLs
Data URLs
Risks & Boundaries
Limitations
Knowledge cutoff April 2023; no post-cutoff math content (except Lean proofsteps).
Training instability: LLEMMA-7B stopped early due to NaN losses, indicating fragility in large runs.
When Not To Use
As a general-purpose chat assistant without further instruction-tuning or safety finetuning.
For tasks needing up-to-date knowledge after April 2023.
Failure Modes
Plausible but incorrect math steps or final answers (hallucinated reasoning).
Occasional memorized sequences that are not correct answers despite overlap.

