Open-source toolkit, benchmark, and a retrieval-augmented LLM that proves Lean theorems on one GPU-week

Overview

Decision SnapshotReady For Pilot

Retrieval plus program analysis yields clear gains for premise selection and proof success, but context-length limits and smaller model size cap top performance; reproducible and low-cost to run.

Citations38

Evidence Strength0.80

Confidence0.90

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/5

Reproducibility

Status: Code + data available

Open source: Yes

License: Code: MIT; Data: CC BY 2.0; mathlib/Lean dependencies: Apache 2.0

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 65%

Authors

Kaiyu Yang, Aidan M. Swope, Alex Gu, Rahul Chalamala, Peiyang Song, Shixing Yu, Saad Godil, Ryan Prenger, Anima Anandkumar

Links

Abstract / PDF / Code / Data

Why It Matters For Business

LeanDojo lowers the entry cost for ML research on formal proofs: open data and code let teams reproduce and iterate on provers with a single GPU-week instead of thousands of GPU-days.

Who Should Care

CTO ML Engineer Engineering Lead Data Scientist Product Manager

Summary TLDR

LeanDojo is an open-source playground for formal theorem proving in Lean: tools to extract fine-grained proof data (states, tactics, premises), a 98.7K-theorem benchmark with a challenging split, and ReProver — a retrieval-augmented tactic generator. ReProver uses a DPR-style retriever restricted to accessible premises plus hard negatives and a ByT5 generator. It trains in ~1 GPU-week and achieves Pass@1 51.2% on the random split and 26.3% on the novel-premises split, outperforming non-retrieval baselines and a GPT-4 zero-shot tactic baseline. All code, data, and models are released.

Problem Statement

LLM-based provers for proof assistants are powerful but hard to reproduce: private code/data, huge compute, and poor premise selection. Proving in Lean needs picking relevant premises from a huge library; LLMs with limited context windows cannot reliably generalize to theorems that require lemmas unseen in training.

Main Contribution

LeanDojo: an open-source toolkit that extracts Lean proof trees, states, and explicit premise definitions and enables reliable programmatic interaction with Lean.

LeanDojo Benchmark: 98,734 theorems (with premise definitions) plus a 'novel_premises' split to stress generalization.

Key Findings

Retrieval improves end-to-end proving rates.

NumbersReProver Pass@1 51.2% vs non-retrieval baseline 47.6% (random split)

Practical UseAdd premise retrieval before tactic generation to get measurable proof success gains with modest compute.

Evidence RefSec.6, Abstract, Table 2

Retriever (DPR + program analysis) recovers relevant premises much better than BM25.

NumbersR@1 13.5% vs BM25 R@1 6.7% (random)

Practical UseUse learned dense retrievers and restrict search to accessible premises to raise hit rates for premise selection.

Evidence RefTable 1 (Premise selection)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Pass@1 (theorem proving)	51.2%	non-retrieval generator	+3.6pp	LeanDojo Benchmark (random test)	ReProver w/ retrieval 51.2% vs generator-only 47.6% (Sec.6 Table 2)	Table 2
Pass@1 (theorem proving)	26.3%	non-retrieval generator 23.2%	+3.1pp	LeanDojo Benchmark (novel_premises)	ReProver (novel_premises split) improves over non-retrieval baseline (Sec.6 Table 2)	Table 2

What To Try In 7 Days

Install LeanDojo and run the provided example to extract proof states from a small mathlib subset.

Run ReProver inference on a few target theorems to measure Pass@1 and inspect retrieved premises.

Compare BM25 vs DPR retrieval on your Lean codebase to see retrieval gains for premise selection quickly.

Agent Features

Memory

retrieval memory (external premise corpus)

Tool Use

programmatic interaction with Lean (gym-like interface)ChatGPT plugin for interactive use

Frameworks

ByT5Dense Passage Retriever (DPR)

Architectures

encoder-decoder Transformer (ByT5)Transformer encoder for DPR

Optimization Features

Token Efficiency

input truncated to 2,300 tokens; fits ~10–15 premisesByT5 byte-level model avoids tokenization but increases sequence length

Model Optimization

use of ByT5 small checkpoint (299M) to reduce computedropout on retrieved premises at training (rate 0.5)

System Optimization

mixed precision (bfloat16) and DeepSpeed ZeRO Stage 2 used during finetuning

Training Optimization

contrastive retriever training with in-batch negatives and in-file hard negativestwo-stage train: retriever then generator

Inference Optimization

precompute premise embeddings; single forward pass to embed statebest-first search with beam-generated tactic candidates

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseCode: MIT; Data: CC BY 2.0; mathlib/Lean dependencies: Apache 2.0

Code URLs

https://github.com/lean-dojo/LeanDojo https://github.com/lean-dojo/ReProver

Data URLs

https://doi.org/10.5281/zenodo.8016385 https://doi.org/10.5281/zenodo.8040109 https://leandojo.org

Risks & Boundaries

Limitations

Input length limits mean only ~10–15 retrieved premises can be fed to the generator.

Base model is ByT5-small (299M); stronger LLMs might improve performance but cost more.

When Not To Use

When you need proofs that require fusing hundreds of premises in one step.

If your target proof assistant is not Lean (tools tailored to Lean).

Failure Modes

Retriever misses the ground-truth premise, so generator cannot produce the needed tactic.

Model hallucinates tactics that look plausible but fail when run in Lean.

Core Entities

Models

google/byt5-small (ByT5)ReProver (retriever + generator)Dense Passage Retriever (DPR)GPT-4 (zero-shot baseline)

Metrics

Pass@1R@k (R@1, R@10)MRR

Datasets

LeanDojo Benchmark (Lean 3) - 98,734 theoremsLeanDojo Benchmark 4 - 102,514 theoremsMiniF2FProofNet

Benchmarks

LeanDojo BenchmarkLeanDojo Benchmark 4MiniF2FProofNet

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Retrieval improves end-to-end proving rates.

Retriever (DPR + program analysis) recovers relevant premises much better than BM25.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Cross-encoder re-ranking boosts faithfulness of RAG for CDC policy Q&A

Key finding

DomainRAG: a Chinese benchmark testing how RAG helps LLMs solve college-enrollment questions

Key finding

Practical survey of retrieval-augmented generation (RAG): how retrievers, fusion methods, training and benchmarks fit together

Key finding

Domain-specific RAG cuts hallucinated citations in ophthalmology long-form answers

Key finding

A public end-to-end benchmark showing retrieval quality—not the LLM—mostly determines legal RAG performance

Key finding