Retrieve similar in-project proofs at each step to boost automated Coq proof synthesis

Overview

Decision SnapshotNeeds Validation

Evaluation uses a large curated benchmark and multiple ablations; main caveats are possible pretraining overlap and Coq-version differences that limit generality.

Citations0

Evidence Strength0.85

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 70%

Authors

Kyle Thompson, Nuno Saavedra, Pedro Carrott, Kevin Fisher, Alex Sanchez-Stern, Yuriy Brun, João F. Ferreira, Sorin Lerner, Emily First

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Rango automates more Coq proofs than prior tools, lowering manual proof effort and increasing coverage for projects that use Coq, which can reduce verification costs and time-to-audit.

Who Should Care

ML Engineer Engineering Lead Product Manager Data Scientist

Summary TLDR

Rango is a tool that speeds up automated Coq proof synthesis by retrieving both lemmas and entire in-project proofs at every proof step and feeding them to a fine-tuned LLM. The authors release CoqStoq (196,929 theorems, 2.2M proof steps). On a curated benchmark, Rango proves 32.0% of theorems—29% more than Tactician—and ablations show proof retrieval alone gives a 47% increase in proven theorems. Rango uses sparse retrieval (BM25), LoRA fine-tuning of a 1.3B LLM, and rollout search; code and data are available on GitHub.

Problem Statement

Formal verification gives strong software guarantees but writing proofs is expensive. Prior LLM approaches help but typically only add premises (lemmas) once. They miss in-project proof patterns and the evolving proof state. Rango adds similar, previously completed proofs step-by-step to the model context so the LLM learns and adapts to local proof strategy.

Main Contribution

Rango: an automated Coq proof synthesizer that retrieves both proofs and lemmas at each proof step.

CoqStoq: a new dataset of 196,929 theorems and 2,225,515 proof steps mined from 2,226 GitHub repos, split into training, validation, and a curated benchmark.

Key Findings

Rango proves a larger share of benchmark theorems than prior tools.

Numbers3,325 / 10,396 = 32.0% theorems proven on CoqStoq benchmark

Practical UseIf you want an off-the-shelf automated prover for Coq projects, Rango finds substantially more correct proofs on this benchmark than earlier tools.

Evidence RefTable II, Summary

Rango improves over the prior best tool Tactician by a relative margin.

Numbers29% more theorems than Tactician on the benchmark

Practical UseReplacing Tactician with Rango in a verification pipeline can meaningfully increase automated coverage of theorems.

Evidence RefTable II, Takeaway 1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Theorems proven (CoqStoq benchmark)	3,325 / 10,396 = 32.0%	Tactician 2,575 / 10,396 = 24.8%	≈ +29% relative vs Tactician	CoqStoq benchmark	Table II	Table II
Ablation: effect of proof retriever	Rango 150/500 = 30.0% vs Rango w/o proofs 102/500 = 20.4%	Rango w/o proofs	≈ +47% relative when using proof retriever	ablation set (500 theorems)	Table IV	Table IV

What To Try In 7 Days

Run Rango on a small Coq module to measure baseline theorem coverage and time-to-proof.

Compare BM25 vs TF-IDF retrieval on your project; start with BM25.

If you already fine-tune models, apply LoRA to a 1B–1.5B code model and reuse Rango's prompt structure (proofs+lemmas+state).

Agent Features

Memory

proof bank (in-project completed proofs)lemma bank (in-project lemma statements)

Planning

rollout search (sampled tactic rollouts)

Tool Use

Coq proof assistantCoqPyt proof extractor

Frameworks

LoRAFSDP (distributed fine-tuning)

Architectures

decoder-only LLM (fine-tuned DeepSeek-Coder 1.3B)sparse retriever (BM25/TF-IDF)proof bank + lemma bank storage

Collaboration

combines in-project retrieval with LLM generation

Optimization Features

Token Efficiency

allocates fixed token budgets (e.g., 1024 proofs, 512 lemmas, 512 theorem+script, 1024 proof state,

Infra Optimization

training on 4x A100s; inference on a single RTX 2080

Model Optimization

LoRA

System Optimization

rollout search avoids heavy bookkeeping of candidates

Training Optimization

FSDP for memory-efficient multi-GPU trainingAdam optimizer with lr 1e-3

Inference Optimization

token budget allocations per input to manage context length

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/rkthomps/coq-modeling

Data URLs

https://github.com/rkthomps/coq-modeling (CoqStoq release referenced)

Risks & Boundaries

Limitations

Possible overlap between LLM pretraining data and GitHub benchmark (test contamination risk).

Evaluations limited to Coq 8.18; behavior on other proof assistants or Coq versions unknown.

When Not To Use

Projects incompatible with Coq 8.18 or not compiling with CoqPyt extraction.

Tiny projects with no helpful in-project proofs—Rango-PRE (local prefix) may be simpler.

Failure Modes

Retrieval misses: needed lemmas/proofs not retrieved and model stalls or times out.

Model hallucination when context lacks relevant facts, producing invalid tactics.

Core Entities

Models

DeepSeek-Coder 1.3BCodeBERT 125M

Metrics

Theorems proven (%)Proof length (# tactics)Edit distance to human proof

Datasets

CoqStoq

Benchmarks

CoqStoq benchmarkCoqGym (prior benchmark)

Context Entities

Models

Graph2TacTacticianProverbot9001

Metrics

Timeout-limited proof successAblation comparisons

Datasets

CompCert (included in benchmark)

Benchmarks

Projects compiled for Coq 8.18

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Rango proves a larger share of benchmark theorems than prior tools.

Rango improves over the prior best tool Tactician by a relative margin.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

MTRAG: a human-made benchmark of multi-turn RAG conversations that stresses retrieval, unanswerables, and later-turn context.

Key finding

Atomic fact-checking for medical RAG LLMs boosts factuality and traceability

Key finding

Build query-specific evidence graphs on the fly to fix missing links and filter distractor facts

Key finding

RAGLAB — an open, modular toolkit to reproduce, compare and develop RAG algorithms fairly

Key finding

InsQABench: a Chinese insurance QA benchmark plus SQL-ReAct and RAG-ReAct methods

Key finding