Open-source toolkit, benchmark, and a retrieval-augmented LLM that proves Lean theorems on one GPU-week

June 27, 20238 min

Overview

Production Readiness

0.6

Novelty Score

0.65

Cost Impact Score

0.7

Citation Count

38

Authors

Kaiyu Yang, Aidan M. Swope, Alex Gu, Rahul Chalamala, Peiyang Song, Shixing Yu, Saad Godil, Ryan Prenger, Anima Anandkumar

Links

Abstract / PDF

Why It Matters For Business

LeanDojo lowers the entry cost for ML research on formal proofs: open data and code let teams reproduce and iterate on provers with a single GPU-week instead of thousands of GPU-days.

Summary TLDR

LeanDojo is an open-source playground for formal theorem proving in Lean: tools to extract fine-grained proof data (states, tactics, premises), a 98.7K-theorem benchmark with a challenging split, and ReProver — a retrieval-augmented tactic generator. ReProver uses a DPR-style retriever restricted to accessible premises plus hard negatives and a ByT5 generator. It trains in ~1 GPU-week and achieves Pass@1 51.2% on the random split and 26.3% on the novel-premises split, outperforming non-retrieval baselines and a GPT-4 zero-shot tactic baseline. All code, data, and models are released.

Problem Statement

LLM-based provers for proof assistants are powerful but hard to reproduce: private code/data, huge compute, and poor premise selection. Proving in Lean needs picking relevant premises from a huge library; LLMs with limited context windows cannot reliably generalize to theorems that require lemmas unseen in training.

Main Contribution

LeanDojo: an open-source toolkit that extracts Lean proof trees, states, and explicit premise definitions and enables reliable programmatic interaction with Lean.

LeanDojo Benchmark: 98,734 theorems (with premise definitions) plus a 'novel_premises' split to stress generalization.

ReProver: a retrieval-augmented prover combining DPR-style retrieval (accessible-premise filtering + in-file hard negatives) with a ByT5 tactic generator.

Open release: code, datasets, and model checkpoints under permissive licenses to allow low-cost reproduction.

Key Findings

Retrieval improves end-to-end proving rates.

NumbersReProver Pass@1 51.2% vs non-retrieval baseline 47.6% (random split)

Retriever (DPR + program analysis) recovers relevant premises much better than BM25.

NumbersR@1 13.5% vs BM25 R@1 6.7% (random)

Program-analysis and hard negatives materially help retrieval.

NumbersAccessible premises average 33.1K vs library 130K; in-file negatives raise MRR to 0.31

LeanDojo interaction is far more reliable than prior tools.

NumbersCorrect-proof misjudgment reduced from 21.1% to 1.4%

Benchmark scale and split expose memorization shortcuts.

Numbers98,734 theorems; novel_premises split forces use of unseen premises (test split) and drops performance

Results

Pass@1 (theorem proving)

Value51.2%

Baselinenon-retrieval generator

Pass@1 (theorem proving)

Value26.3%

Baselinenon-retrieval generator 23.2%

Premise retrieval R@1 / R@10 / MRR

Value13.5% / 38.4% / 0.31

BaselineBM25 6.7% / 17.2% / 0.15

Interaction reliability (proof checking error rate)

Value1.4% errors

Baselinelean-gym 21.1% errors

Dataset size

Value98,734 theorems; 130,262 premises

Who Should Care

What To Try In 7 Days

Install LeanDojo and run the provided example to extract proof states from a small mathlib subset.

Run ReProver inference on a few target theorems to measure Pass@1 and inspect retrieved premises.

Compare BM25 vs DPR retrieval on your Lean codebase to see retrieval gains for premise selection quickly.

Agent Features

Memory

  • retrieval memory (external premise corpus)

Tool Use

  • programmatic interaction with Lean (gym-like interface)
  • ChatGPT plugin for interactive use

Frameworks

  • ByT5
  • Dense Passage Retriever (DPR)

Architectures

  • encoder-decoder Transformer (ByT5)
  • Transformer encoder for DPR

Optimization Features

Token Efficiency

  • input truncated to 2,300 tokens; fits ~10–15 premises
  • ByT5 byte-level model avoids tokenization but increases sequence length

Model Optimization

  • use of ByT5 small checkpoint (299M) to reduce compute
  • dropout on retrieved premises at training (rate 0.5)

System Optimization

  • mixed precision (bfloat16) and DeepSpeed ZeRO Stage 2 used during finetuning

Training Optimization

  • contrastive retriever training with in-batch negatives and in-file hard negatives
  • two-stage train: retriever then generator

Inference Optimization

  • precompute premise embeddings; single forward pass to embed state
  • best-first search with beam-generated tactic candidates

Reproducibility

License

  • Code: MIT; Data: CC BY 2.0; mathlib/Lean dependencies: Apache 2.0

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Input length limits mean only ~10–15 retrieved premises can be fed to the generator.
  • Base model is ByT5-small (299M); stronger LLMs might improve performance but cost more.
  • No reinforcement learning or auxiliary synthetic data used; those could raise results but increase complexity.
  • Benchmark is Lean-centric (mathlib); cross-assistant generalization is limited.

When Not To Use

  • When you need proofs that require fusing hundreds of premises in one step.
  • If your target proof assistant is not Lean (tools tailored to Lean).
  • For high-assurance production verification without further human review.

Failure Modes

  • Retriever misses the ground-truth premise, so generator cannot produce the needed tactic.
  • Model hallucinates tactics that look plausible but fail when run in Lean.
  • Proof environment mismatches can still cause rare misjudgments (~1.4%).

Core Entities

Models

  • google/byt5-small (ByT5)
  • ReProver (retriever + generator)
  • Dense Passage Retriever (DPR)
  • GPT-4 (zero-shot baseline)

Metrics

  • Pass@1
  • R@k (R@1, R@10)
  • MRR

Datasets

  • LeanDojo Benchmark (Lean 3) - 98,734 theorems
  • LeanDojo Benchmark 4 - 102,514 theorems
  • MiniF2F
  • ProofNet

Benchmarks

  • LeanDojo Benchmark
  • LeanDojo Benchmark 4
  • MiniF2F
  • ProofNet