An open math-specialized LLM (7B & 34B) that improves math problem solving and formal proving

Overview

Decision SnapshotNeeds Validation

The paper supplies reproducible artifacts and multi-benchmark evaluations showing robust improvements on math tasks; results are strong for research and prototypes but additional testing is needed for production safety and scale.

Citations13

Evidence Strength0.70

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 6/7

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 70%

Authors

Zhangir Azerbayev, Hailey Schoelkopf, Keiran Paster, Marco Dos Santos, Stephen McAleer, Albert Q. Jiang, Jia Deng, Stella Biderman, Sean Welleck

Links

Abstract / PDF / Code / Data

Why It Matters For Business

LLEMMA gives stronger math and formal-proving ability than other open base models while being fully open-source, enabling companies to build reproducible math tools and lower development cost for math-heavy applications.

Who Should Care

ML Engineer Data Scientist Product Manager CTO Founder

Summary TLDR

LLEMMA is an open-source adaptation of Code Llama trained on Proof-Pile-2 (55B tokens of math papers, math web pages, and math code). The authors release 7B and 34B parameter base models, the Proof-Pile-2 corpus, and code. On standard math benchmarks (MATH, GSM8k, OCW) LLEMMA substantially improves few-shot reasoning and tool-augmented problem solving vs. Code Llama and other open models. It also shows few-shot capability for autoformalization and tactic prediction in Isabelle and Lean. The paper includes data-overlap checks and analysis of memorization.

Problem Statement

General-purpose LLMs are not optimized for mathematical reasoning. The paper asks whether continued pretraining on a math-focused corpus improves a base model's ability to solve math problems, use computational tools, and assist formal theorem provers, while remaining open and reproducible.

Main Contribution

Train and release LLEMMA base models specialized for math: 7B and 34B parameters.

Release Proof-Pile-2: a 55B-token open corpus of math papers, filtered web math (OpenWebMath), and AlgebraicStack code (11B tokens).

Key Findings

Continued pretraining on Proof-Pile-2 improves few-shot math reasoning.

NumbersGSM8k few-shot: LLEMMA-34B 51.5% vs Code Llama-34B 29.6% (+21.9 pp)

Practical UseIf you need better out-of-the-box math reasoning, continue pretraining a code-capable base model on targeted math text and code.

Evidence RefTable 1

Tool use further improves solved-rate on math problems.

NumbersGSM8k+Python pass@1: LLEMMA-34B 62.6% vs Code Llama-34B 52.7% (+9.9 pp)

Practical UseAdd a Python/SymPy execution loop to boost numeric-answer accuracy; implement in a few-shot prompt-and-execute pipeline.

Evidence RefTable 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	LLEMMA-34B 51.5%	Code Llama-34B 29.6%	+21.9 pp	GSM8k few-shot	Table 1 GSM8k few-shot	Table 1
Accuracy	LLEMMA-34B 25.0%	Code Llama-34B 12.2%	+12.8 pp	MATH few-shot (greedy)	Table 1 MATH few-shot	Table 1

What To Try In 7 Days

Run the LLEMMA-7B model on a sample of your math-related tasks to compare accuracy and latency.

Prototype a Python-execute loop: prompt -> generate code -> run -> verify numeric answers.

Download Proof-Pile-2 or AlgebraicStack and test a 1-2 epoch continued pretraining on your own model for domain fit checks.

Agent Features

Tool Use

Python executionSymPyIsabelle callsLean tactic generation

Optimization Features

Token Efficiency

context length 4096 tokens

Infra Optimization

trained on 256 A100 40GB GPUsworld size tuning (TP world sizes 2 and 8)

Model Optimization

initialized from Code Llama (Llama 2 weights)RoPE period contraction applied for 7B

System Optimization

FlashAttention 2 used to speed attentiontensor parallelism and ZeRO Stage 1 for memory scaling

Training Optimization

continued pretraining on Proof-Pile-2 mixturemixture weights chosen by minimizing perplexity on MATH training set (2:4:1 arXiv:Web:Code)trained in bfloat16 with cosine LR decay

Inference Optimization

no special inference-time distillation reported

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/EleutherAI/math-lm https://github.com/EleutherAI/LLEMMA (paper repo referenced)

Data URLs

https://huggingface.co/datasets/EleutherAI/proof-pile-2 https://doi.org/10.48550/arXiv.2310.06786 (OpenWebMath paper)AlgebraicStack (included in Proof-Pile-2 release)

Risks & Boundaries

Limitations

Knowledge cutoff April 2023; no post-cutoff math content (except Lean proofsteps).

Training instability: LLEMMA-7B stopped early due to NaN losses, indicating fragility in large runs.

When Not To Use

As a general-purpose chat assistant without further instruction-tuning or safety finetuning.

For tasks needing up-to-date knowledge after April 2023.

Failure Modes

Plausible but incorrect math steps or final answers (hallucinated reasoning).

Occasional memorized sequences that are not correct answers despite overlap.

Core Entities

Models

LLEMMA-7BLLEMMA-34BCode Llama-7BCode Llama-34BMinerva-8BMinerva-62BMinerva-540B

Metrics

Accuracyperplexityformal-proof pass rate

Datasets

Proof-Pile-2AlgebraicStackOpenWebMathArXiv (RedPajama subset)Pile (surrogate)

Benchmarks

MATHGSM8kOCWCoursesMMLU-STEMminiF2F

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Continued pretraining on Proof-Pile-2 improves few-shot math reasoning.

Tool use further improves solved-rate on math problems.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

LEXAM — 340 real law exams, 4.9k questions, and an expert-validated LLM judge for legal reasoning

Key finding

MULTICOM: a multilingual commonsense generation benchmark showing LLMs are better in English

Key finding

ID-MoCQA: 15,590 bilingual Indonesian multi-hop cultural QA items show models can identify regions but fail at situational cultural answers

Key finding

ERI: 57,750 engineering instruction-response items across 9 fields to test LLM reasoning and agent tool-use

Key finding

ElecBench — a domain benchmark that tests LLMs on power-dispatch scenarios across six practical metrics.

Key finding