Retrieve similar in-project proofs at each step to boost automated Coq proof synthesis

December 18, 20248 min

Overview

Decision SnapshotNeeds Validation

Evaluation uses a large curated benchmark and multiple ablations; main caveats are possible pretraining overlap and Coq-version differences that limit generality.

Citations0

Evidence Strength0.85

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 70%

Authors

Kyle Thompson, Nuno Saavedra, Pedro Carrott, Kevin Fisher, Alex Sanchez-Stern, Yuriy Brun, João F. Ferreira, Sorin Lerner, Emily First

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Rango automates more Coq proofs than prior tools, lowering manual proof effort and increasing coverage for projects that use Coq, which can reduce verification costs and time-to-audit.

Who Should Care

Summary TLDR

Rango is a tool that speeds up automated Coq proof synthesis by retrieving both lemmas and entire in-project proofs at every proof step and feeding them to a fine-tuned LLM. The authors release CoqStoq (196,929 theorems, 2.2M proof steps). On a curated benchmark, Rango proves 32.0% of theorems—29% more than Tactician—and ablations show proof retrieval alone gives a 47% increase in proven theorems. Rango uses sparse retrieval (BM25), LoRA fine-tuning of a 1.3B LLM, and rollout search; code and data are available on GitHub.

Problem Statement

Formal verification gives strong software guarantees but writing proofs is expensive. Prior LLM approaches help but typically only add premises (lemmas) once. They miss in-project proof patterns and the evolving proof state. Rango adds similar, previously completed proofs step-by-step to the model context so the LLM learns and adapts to local proof strategy.

Main Contribution

Rango: an automated Coq proof synthesizer that retrieves both proofs and lemmas at each proof step.

CoqStoq: a new dataset of 196,929 theorems and 2,225,515 proof steps mined from 2,226 GitHub repos, split into training, validation, and a curated benchmark.

Key Findings

Rango proves a larger share of benchmark theorems than prior tools.

Numbers3,325 / 10,396 = 32.0% theorems proven on CoqStoq benchmark

Practical UseIf you want an off-the-shelf automated prover for Coq projects, Rango finds substantially more correct proofs on this benchmark than earlier tools.

Evidence RefTable II, Summary

Rango improves over the prior best tool Tactician by a relative margin.

Numbers29% more theorems than Tactician on the benchmark

Practical UseReplacing Tactician with Rango in a verification pipeline can meaningfully increase automated coverage of theorems.

Evidence RefTable II, Takeaway 1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Theorems proven (CoqStoq benchmark)3,325 / 10,396 = 32.0%Tactician 2,575 / 10,396 = 24.8%≈ +29% relative vs TacticianCoqStoq benchmarkTable IITable II
Ablation: effect of proof retrieverRango 150/500 = 30.0% vs Rango w/o proofs 102/500 = 20.4%Rango w/o proofs≈ +47% relative when using proof retrieverablation set (500 theorems)Table IVTable IV

What To Try In 7 Days

Run Rango on a small Coq module to measure baseline theorem coverage and time-to-proof.

Compare BM25 vs TF-IDF retrieval on your project; start with BM25.

If you already fine-tune models, apply LoRA to a 1B–1.5B code model and reuse Rango's prompt structure (proofs+lemmas+state).

Agent Features

Memory
proof bank (in-project completed proofs)lemma bank (in-project lemma statements)
Planning
rollout search (sampled tactic rollouts)
Tool Use
Coq proof assistantCoqPyt proof extractor
Frameworks
LoRAFSDP (distributed fine-tuning)
Architectures
decoder-only LLM (fine-tuned DeepSeek-Coder 1.3B)sparse retriever (BM25/TF-IDF)proof bank + lemma bank storage
Collaboration
combines in-project retrieval with LLM generation

Optimization Features

Token Efficiency

allocates fixed token budgets (e.g., 1024 proofs, 512 lemmas, 512 theorem+script, 1024 proof state,

Infra Optimization
training on 4x A100s; inference on a single RTX 2080
Model Optimization
LoRA
System Optimization
rollout search avoids heavy bookkeeping of candidates
Training Optimization
FSDP for memory-efficient multi-GPU trainingAdam optimizer with lr 1e-3
Inference Optimization
token budget allocations per input to manage context length

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

Possible overlap between LLM pretraining data and GitHub benchmark (test contamination risk).

Evaluations limited to Coq 8.18; behavior on other proof assistants or Coq versions unknown.

When Not To Use

Projects incompatible with Coq 8.18 or not compiling with CoqPyt extraction.

Tiny projects with no helpful in-project proofs—Rango-PRE (local prefix) may be simpler.

Failure Modes

Retrieval misses: needed lemmas/proofs not retrieved and model stalls or times out.

Model hallucination when context lacks relevant facts, producing invalid tactics.

Core Entities

Models

DeepSeek-Coder 1.3BCodeBERT 125M

Metrics

Theorems proven (%)Proof length (# tactics)Edit distance to human proof

Datasets

CoqStoq

Benchmarks

CoqStoq benchmarkCoqGym (prior benchmark)

Context Entities

Models

Graph2TacTacticianProverbot9001

Metrics

Timeout-limited proof successAblation comparisons

Datasets

CompCert (included in benchmark)

Benchmarks

Projects compiled for Coq 8.18