An open math-specialized LLM (7B & 34B) that improves math problem solving and formal proving

October 16, 20237 min

Overview

Decision SnapshotNeeds Validation

The paper supplies reproducible artifacts and multi-benchmark evaluations showing robust improvements on math tasks; results are strong for research and prototypes but additional testing is needed for production safety and scale.

Citations13

Evidence Strength0.70

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 6/7

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 70%

Authors

Zhangir Azerbayev, Hailey Schoelkopf, Keiran Paster, Marco Dos Santos, Stephen McAleer, Albert Q. Jiang, Jia Deng, Stella Biderman, Sean Welleck

Links

Abstract / PDF / Code / Data

Why It Matters For Business

LLEMMA gives stronger math and formal-proving ability than other open base models while being fully open-source, enabling companies to build reproducible math tools and lower development cost for math-heavy applications.

Who Should Care

Summary TLDR

LLEMMA is an open-source adaptation of Code Llama trained on Proof-Pile-2 (55B tokens of math papers, math web pages, and math code). The authors release 7B and 34B parameter base models, the Proof-Pile-2 corpus, and code. On standard math benchmarks (MATH, GSM8k, OCW) LLEMMA substantially improves few-shot reasoning and tool-augmented problem solving vs. Code Llama and other open models. It also shows few-shot capability for autoformalization and tactic prediction in Isabelle and Lean. The paper includes data-overlap checks and analysis of memorization.

Problem Statement

General-purpose LLMs are not optimized for mathematical reasoning. The paper asks whether continued pretraining on a math-focused corpus improves a base model's ability to solve math problems, use computational tools, and assist formal theorem provers, while remaining open and reproducible.

Main Contribution

Train and release LLEMMA base models specialized for math: 7B and 34B parameters.

Release Proof-Pile-2: a 55B-token open corpus of math papers, filtered web math (OpenWebMath), and AlgebraicStack code (11B tokens).

Key Findings

Continued pretraining on Proof-Pile-2 improves few-shot math reasoning.

NumbersGSM8k few-shot: LLEMMA-34B 51.5% vs Code Llama-34B 29.6% (+21.9 pp)

Practical UseIf you need better out-of-the-box math reasoning, continue pretraining a code-capable base model on targeted math text and code.

Evidence RefTable 1

Tool use further improves solved-rate on math problems.

NumbersGSM8k+Python pass@1: LLEMMA-34B 62.6% vs Code Llama-34B 52.7% (+9.9 pp)

Practical UseAdd a Python/SymPy execution loop to boost numeric-answer accuracy; implement in a few-shot prompt-and-execute pipeline.

Evidence RefTable 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
AccuracyLLEMMA-34B 51.5%Code Llama-34B 29.6%+21.9 ppGSM8k few-shotTable 1 GSM8k few-shotTable 1
AccuracyLLEMMA-34B 25.0%Code Llama-34B 12.2%+12.8 ppMATH few-shot (greedy)Table 1 MATH few-shotTable 1

What To Try In 7 Days

Run the LLEMMA-7B model on a sample of your math-related tasks to compare accuracy and latency.

Prototype a Python-execute loop: prompt -> generate code -> run -> verify numeric answers.

Download Proof-Pile-2 or AlgebraicStack and test a 1-2 epoch continued pretraining on your own model for domain fit checks.

Agent Features

Tool Use
Python executionSymPyIsabelle callsLean tactic generation

Optimization Features

Token Efficiency
context length 4096 tokens
Infra Optimization
trained on 256 A100 40GB GPUsworld size tuning (TP world sizes 2 and 8)
Model Optimization
initialized from Code Llama (Llama 2 weights)RoPE period contraction applied for 7B
System Optimization
FlashAttention 2 used to speed attentiontensor parallelism and ZeRO Stage 1 for memory scaling
Training Optimization
continued pretraining on Proof-Pile-2 mixturemixture weights chosen by minimizing perplexity on MATH training set (2:4:1 arXiv:Web:Code)trained in bfloat16 with cosine LR decay
Inference Optimization
no special inference-time distillation reported

Reproducibility

Risks & Boundaries

Limitations

Knowledge cutoff April 2023; no post-cutoff math content (except Lean proofsteps).

Training instability: LLEMMA-7B stopped early due to NaN losses, indicating fragility in large runs.

When Not To Use

As a general-purpose chat assistant without further instruction-tuning or safety finetuning.

For tasks needing up-to-date knowledge after April 2023.

Failure Modes

Plausible but incorrect math steps or final answers (hallucinated reasoning).

Occasional memorized sequences that are not correct answers despite overlap.

Core Entities

Models

LLEMMA-7BLLEMMA-34BCode Llama-7BCode Llama-34BMinerva-8BMinerva-62BMinerva-540B

Metrics

Accuracyperplexityformal-proof pass rate

Datasets

Proof-Pile-2AlgebraicStackOpenWebMathArXiv (RedPajama subset)Pile (surrogate)

Benchmarks

MATHGSM8kOCWCoursesMMLU-STEMminiF2F