Retrieve similar in-project proofs at each step to boost automated Coq proof synthesis

December 18, 20248 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.7

Citation Count

0

Authors

Kyle Thompson, Nuno Saavedra, Pedro Carrott, Kevin Fisher, Alex Sanchez-Stern, Yuriy Brun, João F. Ferreira, Sorin Lerner, Emily First

Links

Abstract / PDF

Why It Matters For Business

Rango automates more Coq proofs than prior tools, lowering manual proof effort and increasing coverage for projects that use Coq, which can reduce verification costs and time-to-audit.

Summary TLDR

Rango is a tool that speeds up automated Coq proof synthesis by retrieving both lemmas and entire in-project proofs at every proof step and feeding them to a fine-tuned LLM. The authors release CoqStoq (196,929 theorems, 2.2M proof steps). On a curated benchmark, Rango proves 32.0% of theorems—29% more than Tactician—and ablations show proof retrieval alone gives a 47% increase in proven theorems. Rango uses sparse retrieval (BM25), LoRA fine-tuning of a 1.3B LLM, and rollout search; code and data are available on GitHub.

Problem Statement

Formal verification gives strong software guarantees but writing proofs is expensive. Prior LLM approaches help but typically only add premises (lemmas) once. They miss in-project proof patterns and the evolving proof state. Rango adds similar, previously completed proofs step-by-step to the model context so the LLM learns and adapts to local proof strategy.

Main Contribution

Rango: an automated Coq proof synthesizer that retrieves both proofs and lemmas at each proof step.

CoqStoq: a new dataset of 196,929 theorems and 2,225,515 proof steps mined from 2,226 GitHub repos, split into training, validation, and a curated benchmark.

Empirical evaluation and ablations showing Rango proves 32.0% of benchmark theorems and that proof retrieval yields a 47% relative gain.

Public release of Rango, CoqStoq, trained models, and code for reproduction.

Key Findings

Rango proves a larger share of benchmark theorems than prior tools.

Numbers3,325 / 10,396 = 32.0% theorems proven on CoqStoq benchmark

Rango improves over the prior best tool Tactician by a relative margin.

Numbers29% more theorems than Tactician on the benchmark

Adding in-project proofs to context is a major driver of success.

NumbersProof retriever gives a 47% relative increase in proven theorems (ablation on 500 theorems)

Sparse retrieval (BM25) outperformed dense CodeBERT embeddings for proof retrieval.

NumbersBM25: 32.0% vs CodeBERT: 22.0% on CoqStoq benchmark

Results

Theorems proven (CoqStoq benchmark)

Value3,325 / 10,396 = 32.0%

BaselineTactician 2,575 / 10,396 = 24.8%

Ablation: effect of proof retriever

ValueRango 150/500 = 30.0% vs Rango w/o proofs 102/500 = 20.4%

BaselineRango w/o proofs

Retrieval algorithm comparison

ValueBM25 3,325/10,396 = 32.0% vs CodeBERT 2,283/10,396 = 22.0%

BaselineCodeBERT embeddings

Searcher comparison (ablation)

ValueRollout 150/500 = 30.0% vs Best-first (beam) 142/500 = 28.4%

BaselineBest-first (beam)

Who Should Care

What To Try In 7 Days

Run Rango on a small Coq module to measure baseline theorem coverage and time-to-proof.

Compare BM25 vs TF-IDF retrieval on your project; start with BM25.

If you already fine-tune models, apply LoRA to a 1B–1.5B code model and reuse Rango's prompt structure (proofs+lemmas+state).

Agent Features

Memory

  • proof bank (in-project completed proofs)
  • lemma bank (in-project lemma statements)

Planning

  • rollout search (sampled tactic rollouts)

Tool Use

  • Coq proof assistant
  • CoqPyt proof extractor

Frameworks

  • LoRA
  • FSDP (distributed fine-tuning)

Architectures

  • decoder-only LLM (fine-tuned DeepSeek-Coder 1.3B)
  • sparse retriever (BM25/TF-IDF)
  • proof bank + lemma bank storage

Collaboration

  • combines in-project retrieval with LLM generation

Optimization Features

Token Efficiency

  • allocates fixed token budgets (e.g., 1024 proofs, 512 lemmas, 512 theorem+script, 1024 proof state,

Infra Optimization

  • training on 4x A100s; inference on a single RTX 2080

Model Optimization

  • LoRA

System Optimization

  • rollout search avoids heavy bookkeeping of candidates

Training Optimization

  • FSDP for memory-efficient multi-GPU training
  • Adam optimizer with lr 1e-3

Inference Optimization

  • token budget allocations per input to manage context length

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Possible overlap between LLM pretraining data and GitHub benchmark (test contamination risk).
  • Evaluations limited to Coq 8.18; behavior on other proof assistants or Coq versions unknown.
  • Performance drops on files with many external dependencies and on long human-written proofs.
  • Inference required GPU for reasonable throughput; other tools were CPU-bound in comparisons.

When Not To Use

  • Projects incompatible with Coq 8.18 or not compiling with CoqPyt extraction.
  • Tiny projects with no helpful in-project proofs—Rango-PRE (local prefix) may be simpler.
  • When strict reproducibility is required and LLM pretraining contamination is unacceptable.

Failure Modes

  • Retrieval misses: needed lemmas/proofs not retrieved and model stalls or times out.
  • Model hallucination when context lacks relevant facts, producing invalid tactics.
  • Timeouts on hard proofs due to rollout search sampling limits.
  • Degraded performance on files with many dependencies or very long proofs.

Core Entities

Models

  • DeepSeek-Coder 1.3B
  • CodeBERT 125M

Metrics

  • Theorems proven (%)
  • Proof length (# tactics)
  • Edit distance to human proof

Datasets

  • CoqStoq

Benchmarks

  • CoqStoq benchmark
  • CoqGym (prior benchmark)

Context Entities

Models

  • Graph2Tac
  • Tactician
  • Proverbot9001

Metrics

  • Timeout-limited proof success
  • Ablation comparisons

Datasets

  • CompCert (included in benchmark)

Benchmarks

  • Projects compiled for Coq 8.18