Overview
Production Readiness
0.6
Novelty Score
0.65
Cost Impact Score
0.7
Citation Count
38
Why It Matters For Business
LeanDojo lowers the entry cost for ML research on formal proofs: open data and code let teams reproduce and iterate on provers with a single GPU-week instead of thousands of GPU-days.
Summary TLDR
LeanDojo is an open-source playground for formal theorem proving in Lean: tools to extract fine-grained proof data (states, tactics, premises), a 98.7K-theorem benchmark with a challenging split, and ReProver — a retrieval-augmented tactic generator. ReProver uses a DPR-style retriever restricted to accessible premises plus hard negatives and a ByT5 generator. It trains in ~1 GPU-week and achieves Pass@1 51.2% on the random split and 26.3% on the novel-premises split, outperforming non-retrieval baselines and a GPT-4 zero-shot tactic baseline. All code, data, and models are released.
Problem Statement
LLM-based provers for proof assistants are powerful but hard to reproduce: private code/data, huge compute, and poor premise selection. Proving in Lean needs picking relevant premises from a huge library; LLMs with limited context windows cannot reliably generalize to theorems that require lemmas unseen in training.
Main Contribution
LeanDojo: an open-source toolkit that extracts Lean proof trees, states, and explicit premise definitions and enables reliable programmatic interaction with Lean.
LeanDojo Benchmark: 98,734 theorems (with premise definitions) plus a 'novel_premises' split to stress generalization.
ReProver: a retrieval-augmented prover combining DPR-style retrieval (accessible-premise filtering + in-file hard negatives) with a ByT5 tactic generator.
Open release: code, datasets, and model checkpoints under permissive licenses to allow low-cost reproduction.
Key Findings
Retrieval improves end-to-end proving rates.
Retriever (DPR + program analysis) recovers relevant premises much better than BM25.
Program-analysis and hard negatives materially help retrieval.
LeanDojo interaction is far more reliable than prior tools.
Benchmark scale and split expose memorization shortcuts.
Results
Pass@1 (theorem proving)
Pass@1 (theorem proving)
Premise retrieval R@1 / R@10 / MRR
Interaction reliability (proof checking error rate)
Dataset size
Who Should Care
What To Try In 7 Days
Install LeanDojo and run the provided example to extract proof states from a small mathlib subset.
Run ReProver inference on a few target theorems to measure Pass@1 and inspect retrieved premises.
Compare BM25 vs DPR retrieval on your Lean codebase to see retrieval gains for premise selection quickly.
Agent Features
Memory
- retrieval memory (external premise corpus)
Tool Use
- programmatic interaction with Lean (gym-like interface)
- ChatGPT plugin for interactive use
Frameworks
- ByT5
- Dense Passage Retriever (DPR)
Architectures
- encoder-decoder Transformer (ByT5)
- Transformer encoder for DPR
Optimization Features
Token Efficiency
- input truncated to 2,300 tokens; fits ~10–15 premises
- ByT5 byte-level model avoids tokenization but increases sequence length
Model Optimization
- use of ByT5 small checkpoint (299M) to reduce compute
- dropout on retrieved premises at training (rate 0.5)
System Optimization
- mixed precision (bfloat16) and DeepSpeed ZeRO Stage 2 used during finetuning
Training Optimization
- contrastive retriever training with in-batch negatives and in-file hard negatives
- two-stage train: retriever then generator
Inference Optimization
- precompute premise embeddings; single forward pass to embed state
- best-first search with beam-generated tactic candidates
Reproducibility
License
- Code: MIT; Data: CC BY 2.0; mathlib/Lean dependencies: Apache 2.0
Data Urls
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Input length limits mean only ~10–15 retrieved premises can be fed to the generator.
- Base model is ByT5-small (299M); stronger LLMs might improve performance but cost more.
- No reinforcement learning or auxiliary synthetic data used; those could raise results but increase complexity.
- Benchmark is Lean-centric (mathlib); cross-assistant generalization is limited.
When Not To Use
- When you need proofs that require fusing hundreds of premises in one step.
- If your target proof assistant is not Lean (tools tailored to Lean).
- For high-assurance production verification without further human review.
Failure Modes
- Retriever misses the ground-truth premise, so generator cannot produce the needed tactic.
- Model hallucinates tactics that look plausible but fail when run in Lean.
- Proof environment mismatches can still cause rare misjudgments (~1.4%).
Core Entities
Models
- google/byt5-small (ByT5)
- ReProver (retriever + generator)
- Dense Passage Retriever (DPR)
- GPT-4 (zero-shot baseline)
Metrics
- Pass@1
- R@k (R@1, R@10)
- MRR
Datasets
- LeanDojo Benchmark (Lean 3) - 98,734 theorems
- LeanDojo Benchmark 4 - 102,514 theorems
- MiniF2F
- ProofNet
Benchmarks
- LeanDojo Benchmark
- LeanDojo Benchmark 4
- MiniF2F
- ProofNet

