An open math-specialized LLM (7B & 34B) that improves math problem solving and formal proving

October 16, 20237 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.6

Citation Count

13

Authors

Zhangir Azerbayev, Hailey Schoelkopf, Keiran Paster, Marco Dos Santos, Stephen McAleer, Albert Q. Jiang, Jia Deng, Stella Biderman, Sean Welleck

Links

Abstract / PDF

Why It Matters For Business

LLEMMA gives stronger math and formal-proving ability than other open base models while being fully open-source, enabling companies to build reproducible math tools and lower development cost for math-heavy applications.

Summary TLDR

LLEMMA is an open-source adaptation of Code Llama trained on Proof-Pile-2 (55B tokens of math papers, math web pages, and math code). The authors release 7B and 34B parameter base models, the Proof-Pile-2 corpus, and code. On standard math benchmarks (MATH, GSM8k, OCW) LLEMMA substantially improves few-shot reasoning and tool-augmented problem solving vs. Code Llama and other open models. It also shows few-shot capability for autoformalization and tactic prediction in Isabelle and Lean. The paper includes data-overlap checks and analysis of memorization.

Problem Statement

General-purpose LLMs are not optimized for mathematical reasoning. The paper asks whether continued pretraining on a math-focused corpus improves a base model's ability to solve math problems, use computational tools, and assist formal theorem provers, while remaining open and reproducible.

Main Contribution

Train and release LLEMMA base models specialized for math: 7B and 34B parameters.

Release Proof-Pile-2: a 55B-token open corpus of math papers, filtered web math (OpenWebMath), and AlgebraicStack code (11B tokens).

Show improved few-shot math reasoning and higher performance when using tools (Python, SymPy).

Demonstrate few-shot formal theorem proving: informal-to-formal (Isabelle) and formal-to-formal tactic prediction (Lean) without finetuning.

Publish code, datasets, and evaluation harness to enable reproduction and follow-up work.

Key Findings

Continued pretraining on Proof-Pile-2 improves few-shot math reasoning.

NumbersGSM8k few-shot: LLEMMA-34B 51.5% vs Code Llama-34B 29.6% (+21.9 pp)

Tool use further improves solved-rate on math problems.

NumbersGSM8k+Python pass@1: LLEMMA-34B 62.6% vs Code Llama-34B 52.7% (+9.9 pp)

LLEMMA beats other open base models on the MATH benchmark.

NumbersMATH few-shot: LLEMMA-34B 25.0% vs Code Llama-34B 12.2% (+12.8 pp)

Few-shot formal theorem proving improved via math-focused pretraining.

NumbersIsabelle informal-to-formal (miniF2F-test): LLEMMA-7B 22.13% vs Code Llama-7B 17.62% (+4.51 pp)

Training-data overlap (30-gram matches) with MATH problems is limited and not clearly tied to correct answers.

Numbers≈7% of MATH problems and 0.6% of MATH solutions have 30-gram hits in Proof-Pile-2

Results

Accuracy

ValueLLEMMA-34B 51.5%

BaselineCode Llama-34B 29.6%

Accuracy

ValueLLEMMA-34B 25.0%

BaselineCode Llama-34B 12.2%

GSM8k+Python pass@1

ValueLLEMMA-34B 62.6%

BaselineCode Llama-34B 52.7%

MATH+Python pass@1

ValueLLEMMA-34B 27.1%

BaselineCode Llama-34B 23.5%

Isabelle informal-to-formal pass@1

ValueLLEMMA-7B 22.13%

BaselineCode Llama-7B 17.62%

Lean formal-to-formal pass@search

ValueLLEMMA-34B 25.82%

BaselineCode Llama-34B 22.13%

Training-data overlap (30-gram hits)

Value≈7% problems, 0.6% solutions

Who Should Care

What To Try In 7 Days

Run the LLEMMA-7B model on a sample of your math-related tasks to compare accuracy and latency.

Prototype a Python-execute loop: prompt -> generate code -> run -> verify numeric answers.

Download Proof-Pile-2 or AlgebraicStack and test a 1-2 epoch continued pretraining on your own model for domain fit checks.

Agent Features

Tool Use

  • Python execution
  • SymPy
  • Isabelle calls
  • Lean tactic generation

Optimization Features

Token Efficiency

  • context length 4096 tokens

Infra Optimization

  • trained on 256 A100 40GB GPUs
  • world size tuning (TP world sizes 2 and 8)

Model Optimization

  • initialized from Code Llama (Llama 2 weights)
  • RoPE period contraction applied for 7B

System Optimization

  • FlashAttention 2 used to speed attention
  • tensor parallelism and ZeRO Stage 1 for memory scaling

Training Optimization

  • continued pretraining on Proof-Pile-2 mixture
  • mixture weights chosen by minimizing perplexity on MATH training set (2:4:1 arXiv:Web:Code)
  • trained in bfloat16 with cosine LR decay

Inference Optimization

  • no special inference-time distillation reported

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Knowledge cutoff April 2023; no post-cutoff math content (except Lean proofsteps).
  • Training instability: LLEMMA-7B stopped early due to NaN losses, indicating fragility in large runs.
  • Benchmarks remain imperfect proxies; some web-derived alternative solutions cause overlap that complicates evaluation.

When Not To Use

  • As a general-purpose chat assistant without further instruction-tuning or safety finetuning.
  • For tasks needing up-to-date knowledge after April 2023.
  • Where strict formal guarantees are required without human verification of generated proofs.

Failure Modes

  • Plausible but incorrect math steps or final answers (hallucinated reasoning).
  • Occasional memorized sequences that are not correct answers despite overlap.
  • Timeouts or search-budget limits when applying best-first search in formal proving.

Core Entities

Models

  • LLEMMA-7B
  • LLEMMA-34B
  • Code Llama-7B
  • Code Llama-34B
  • Minerva-8B
  • Minerva-62B
  • Minerva-540B

Metrics

  • Accuracy
  • perplexity
  • formal-proof pass rate

Datasets

  • Proof-Pile-2
  • AlgebraicStack
  • OpenWebMath
  • ArXiv (RedPajama subset)
  • Pile (surrogate)

Benchmarks

  • MATH
  • GSM8k
  • OCWCourses
  • MMLU-STEM
  • miniF2F