Overview
Evaluation uses a large curated benchmark and multiple ablations; main caveats are possible pretraining overlap and Coq-version differences that limit generality.
Citations0
Evidence Strength0.85
Confidence0.85
Risk Signals11
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 4/4
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 70%
Production readiness: 60%
Novelty: 70%
Why It Matters For Business
Rango automates more Coq proofs than prior tools, lowering manual proof effort and increasing coverage for projects that use Coq, which can reduce verification costs and time-to-audit.
Who Should Care
Summary TLDR
Rango is a tool that speeds up automated Coq proof synthesis by retrieving both lemmas and entire in-project proofs at every proof step and feeding them to a fine-tuned LLM. The authors release CoqStoq (196,929 theorems, 2.2M proof steps). On a curated benchmark, Rango proves 32.0% of theorems—29% more than Tactician—and ablations show proof retrieval alone gives a 47% increase in proven theorems. Rango uses sparse retrieval (BM25), LoRA fine-tuning of a 1.3B LLM, and rollout search; code and data are available on GitHub.
Problem Statement
Formal verification gives strong software guarantees but writing proofs is expensive. Prior LLM approaches help but typically only add premises (lemmas) once. They miss in-project proof patterns and the evolving proof state. Rango adds similar, previously completed proofs step-by-step to the model context so the LLM learns and adapts to local proof strategy.
Main Contribution
Rango: an automated Coq proof synthesizer that retrieves both proofs and lemmas at each proof step.
CoqStoq: a new dataset of 196,929 theorems and 2,225,515 proof steps mined from 2,226 GitHub repos, split into training, validation, and a curated benchmark.
Key Findings
Rango proves a larger share of benchmark theorems than prior tools.
Rango improves over the prior best tool Tactician by a relative margin.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Theorems proven (CoqStoq benchmark) | 3,325 / 10,396 = 32.0% | Tactician 2,575 / 10,396 = 24.8% | ≈ +29% relative vs Tactician | CoqStoq benchmark | Table II | Table II |
| Ablation: effect of proof retriever | Rango 150/500 = 30.0% vs Rango w/o proofs 102/500 = 20.4% | Rango w/o proofs | ≈ +47% relative when using proof retriever | ablation set (500 theorems) | Table IV | Table IV |
What To Try In 7 Days
Run Rango on a small Coq module to measure baseline theorem coverage and time-to-proof.
Compare BM25 vs TF-IDF retrieval on your project; start with BM25.
If you already fine-tune models, apply LoRA to a 1B–1.5B code model and reuse Rango's prompt structure (proofs+lemmas+state).
Agent Features
Memory
Planning
Tool Use
Frameworks
Architectures
Collaboration
Optimization Features
Token Efficiency
allocates fixed token budgets (e.g., 1024 proofs, 512 lemmas, 512 theorem+script, 1024 proof state,
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Possible overlap between LLM pretraining data and GitHub benchmark (test contamination risk).
Evaluations limited to Coq 8.18; behavior on other proof assistants or Coq versions unknown.
When Not To Use
Projects incompatible with Coq 8.18 or not compiling with CoqPyt extraction.
Tiny projects with no helpful in-project proofs—Rango-PRE (local prefix) may be simpler.
Failure Modes
Retrieval misses: needed lemmas/proofs not retrieved and model stalls or times out.
Model hallucination when context lacks relevant facts, producing invalid tactics.

