Overview
Retrieval plus program analysis yields clear gains for premise selection and proof success, but context-length limits and smaller model size cap top performance; reproducible and low-cost to run.
Citations38
Evidence Strength0.80
Confidence0.90
Risk Signals10
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 4/5
Reproducibility
Status: Code + data available
Open source: Yes
License: Code: MIT; Data: CC BY 2.0; mathlib/Lean dependencies: Apache 2.0
At A Glance
Cost impact: 70%
Production readiness: 60%
Novelty: 65%
Why It Matters For Business
LeanDojo lowers the entry cost for ML research on formal proofs: open data and code let teams reproduce and iterate on provers with a single GPU-week instead of thousands of GPU-days.
Who Should Care
Summary TLDR
LeanDojo is an open-source playground for formal theorem proving in Lean: tools to extract fine-grained proof data (states, tactics, premises), a 98.7K-theorem benchmark with a challenging split, and ReProver — a retrieval-augmented tactic generator. ReProver uses a DPR-style retriever restricted to accessible premises plus hard negatives and a ByT5 generator. It trains in ~1 GPU-week and achieves Pass@1 51.2% on the random split and 26.3% on the novel-premises split, outperforming non-retrieval baselines and a GPT-4 zero-shot tactic baseline. All code, data, and models are released.
Problem Statement
LLM-based provers for proof assistants are powerful but hard to reproduce: private code/data, huge compute, and poor premise selection. Proving in Lean needs picking relevant premises from a huge library; LLMs with limited context windows cannot reliably generalize to theorems that require lemmas unseen in training.
Main Contribution
LeanDojo: an open-source toolkit that extracts Lean proof trees, states, and explicit premise definitions and enables reliable programmatic interaction with Lean.
LeanDojo Benchmark: 98,734 theorems (with premise definitions) plus a 'novel_premises' split to stress generalization.
Key Findings
Retrieval improves end-to-end proving rates.
Retriever (DPR + program analysis) recovers relevant premises much better than BM25.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Pass@1 (theorem proving) | 51.2% | non-retrieval generator | +3.6pp | LeanDojo Benchmark (random test) | ReProver w/ retrieval 51.2% vs generator-only 47.6% (Sec.6 Table 2) | Table 2 |
| Pass@1 (theorem proving) | 26.3% | non-retrieval generator 23.2% | +3.1pp | LeanDojo Benchmark (novel_premises) | ReProver (novel_premises split) improves over non-retrieval baseline (Sec.6 Table 2) | Table 2 |
What To Try In 7 Days
Install LeanDojo and run the provided example to extract proof states from a small mathlib subset.
Run ReProver inference on a few target theorems to measure Pass@1 and inspect retrieved premises.
Compare BM25 vs DPR retrieval on your Lean codebase to see retrieval gains for premise selection quickly.
Agent Features
Memory
Tool Use
Frameworks
Architectures
Optimization Features
Token Efficiency
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Input length limits mean only ~10–15 retrieved premises can be fed to the generator.
Base model is ByT5-small (299M); stronger LLMs might improve performance but cost more.
When Not To Use
When you need proofs that require fusing hundreds of premises in one step.
If your target proof assistant is not Lean (tools tailored to Lean).
Failure Modes
Retriever misses the ground-truth premise, so generator cannot produce the needed tactic.
Model hallucinates tactics that look plausible but fail when run in Lean.

