Overview
Production Readiness
0.6
Novelty Score
0.7
Cost Impact Score
0.6
Citation Count
13
Why It Matters For Business
LLEMMA gives stronger math and formal-proving ability than other open base models while being fully open-source, enabling companies to build reproducible math tools and lower development cost for math-heavy applications.
Summary TLDR
LLEMMA is an open-source adaptation of Code Llama trained on Proof-Pile-2 (55B tokens of math papers, math web pages, and math code). The authors release 7B and 34B parameter base models, the Proof-Pile-2 corpus, and code. On standard math benchmarks (MATH, GSM8k, OCW) LLEMMA substantially improves few-shot reasoning and tool-augmented problem solving vs. Code Llama and other open models. It also shows few-shot capability for autoformalization and tactic prediction in Isabelle and Lean. The paper includes data-overlap checks and analysis of memorization.
Problem Statement
General-purpose LLMs are not optimized for mathematical reasoning. The paper asks whether continued pretraining on a math-focused corpus improves a base model's ability to solve math problems, use computational tools, and assist formal theorem provers, while remaining open and reproducible.
Main Contribution
Train and release LLEMMA base models specialized for math: 7B and 34B parameters.
Release Proof-Pile-2: a 55B-token open corpus of math papers, filtered web math (OpenWebMath), and AlgebraicStack code (11B tokens).
Show improved few-shot math reasoning and higher performance when using tools (Python, SymPy).
Demonstrate few-shot formal theorem proving: informal-to-formal (Isabelle) and formal-to-formal tactic prediction (Lean) without finetuning.
Publish code, datasets, and evaluation harness to enable reproduction and follow-up work.
Key Findings
Continued pretraining on Proof-Pile-2 improves few-shot math reasoning.
Tool use further improves solved-rate on math problems.
LLEMMA beats other open base models on the MATH benchmark.
Few-shot formal theorem proving improved via math-focused pretraining.
Training-data overlap (30-gram matches) with MATH problems is limited and not clearly tied to correct answers.
Results
Accuracy
Accuracy
GSM8k+Python pass@1
MATH+Python pass@1
Isabelle informal-to-formal pass@1
Lean formal-to-formal pass@search
Training-data overlap (30-gram hits)
Who Should Care
What To Try In 7 Days
Run the LLEMMA-7B model on a sample of your math-related tasks to compare accuracy and latency.
Prototype a Python-execute loop: prompt -> generate code -> run -> verify numeric answers.
Download Proof-Pile-2 or AlgebraicStack and test a 1-2 epoch continued pretraining on your own model for domain fit checks.
Agent Features
Tool Use
- Python execution
- SymPy
- Isabelle calls
- Lean tactic generation
Optimization Features
Token Efficiency
- context length 4096 tokens
Infra Optimization
- trained on 256 A100 40GB GPUs
- world size tuning (TP world sizes 2 and 8)
Model Optimization
- initialized from Code Llama (Llama 2 weights)
- RoPE period contraction applied for 7B
System Optimization
- FlashAttention 2 used to speed attention
- tensor parallelism and ZeRO Stage 1 for memory scaling
Training Optimization
- continued pretraining on Proof-Pile-2 mixture
- mixture weights chosen by minimizing perplexity on MATH training set (2:4:1 arXiv:Web:Code)
- trained in bfloat16 with cosine LR decay
Inference Optimization
- no special inference-time distillation reported
Reproducibility
Code Urls
Data Urls
- https://huggingface.co/datasets/EleutherAI/proof-pile-2
- https://doi.org/10.48550/arXiv.2310.06786 (OpenWebMath paper)
- AlgebraicStack (included in Proof-Pile-2 release)
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Knowledge cutoff April 2023; no post-cutoff math content (except Lean proofsteps).
- Training instability: LLEMMA-7B stopped early due to NaN losses, indicating fragility in large runs.
- Benchmarks remain imperfect proxies; some web-derived alternative solutions cause overlap that complicates evaluation.
When Not To Use
- As a general-purpose chat assistant without further instruction-tuning or safety finetuning.
- For tasks needing up-to-date knowledge after April 2023.
- Where strict formal guarantees are required without human verification of generated proofs.
Failure Modes
- Plausible but incorrect math steps or final answers (hallucinated reasoning).
- Occasional memorized sequences that are not correct answers despite overlap.
- Timeouts or search-budget limits when applying best-first search in formal proving.
Core Entities
Models
- LLEMMA-7B
- LLEMMA-34B
- Code Llama-7B
- Code Llama-34B
- Minerva-8B
- Minerva-62B
- Minerva-540B
Metrics
- Accuracy
- perplexity
- formal-proof pass rate
Datasets
- Proof-Pile-2
- AlgebraicStack
- OpenWebMath
- ArXiv (RedPajama subset)
- Pile (surrogate)
Benchmarks
- MATH
- GSM8k
- OCWCourses
- MMLU-STEM
- miniF2F

