Overview
Production Readiness
0.6
Novelty Score
0.7
Cost Impact Score
0.7
Citation Count
0
Why It Matters For Business
Rango automates more Coq proofs than prior tools, lowering manual proof effort and increasing coverage for projects that use Coq, which can reduce verification costs and time-to-audit.
Summary TLDR
Rango is a tool that speeds up automated Coq proof synthesis by retrieving both lemmas and entire in-project proofs at every proof step and feeding them to a fine-tuned LLM. The authors release CoqStoq (196,929 theorems, 2.2M proof steps). On a curated benchmark, Rango proves 32.0% of theorems—29% more than Tactician—and ablations show proof retrieval alone gives a 47% increase in proven theorems. Rango uses sparse retrieval (BM25), LoRA fine-tuning of a 1.3B LLM, and rollout search; code and data are available on GitHub.
Problem Statement
Formal verification gives strong software guarantees but writing proofs is expensive. Prior LLM approaches help but typically only add premises (lemmas) once. They miss in-project proof patterns and the evolving proof state. Rango adds similar, previously completed proofs step-by-step to the model context so the LLM learns and adapts to local proof strategy.
Main Contribution
Rango: an automated Coq proof synthesizer that retrieves both proofs and lemmas at each proof step.
CoqStoq: a new dataset of 196,929 theorems and 2,225,515 proof steps mined from 2,226 GitHub repos, split into training, validation, and a curated benchmark.
Empirical evaluation and ablations showing Rango proves 32.0% of benchmark theorems and that proof retrieval yields a 47% relative gain.
Public release of Rango, CoqStoq, trained models, and code for reproduction.
Key Findings
Rango proves a larger share of benchmark theorems than prior tools.
Rango improves over the prior best tool Tactician by a relative margin.
Adding in-project proofs to context is a major driver of success.
Sparse retrieval (BM25) outperformed dense CodeBERT embeddings for proof retrieval.
Results
Theorems proven (CoqStoq benchmark)
Ablation: effect of proof retriever
Retrieval algorithm comparison
Searcher comparison (ablation)
Who Should Care
What To Try In 7 Days
Run Rango on a small Coq module to measure baseline theorem coverage and time-to-proof.
Compare BM25 vs TF-IDF retrieval on your project; start with BM25.
If you already fine-tune models, apply LoRA to a 1B–1.5B code model and reuse Rango's prompt structure (proofs+lemmas+state).
Agent Features
Memory
- proof bank (in-project completed proofs)
- lemma bank (in-project lemma statements)
Planning
- rollout search (sampled tactic rollouts)
Tool Use
- Coq proof assistant
- CoqPyt proof extractor
Frameworks
- LoRA
- FSDP (distributed fine-tuning)
Architectures
- decoder-only LLM (fine-tuned DeepSeek-Coder 1.3B)
- sparse retriever (BM25/TF-IDF)
- proof bank + lemma bank storage
Collaboration
- combines in-project retrieval with LLM generation
Optimization Features
Token Efficiency
- allocates fixed token budgets (e.g., 1024 proofs, 512 lemmas, 512 theorem+script, 1024 proof state,
Infra Optimization
- training on 4x A100s; inference on a single RTX 2080
Model Optimization
- LoRA
System Optimization
- rollout search avoids heavy bookkeeping of candidates
Training Optimization
- FSDP for memory-efficient multi-GPU training
- Adam optimizer with lr 1e-3
Inference Optimization
- token budget allocations per input to manage context length
Reproducibility
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Possible overlap between LLM pretraining data and GitHub benchmark (test contamination risk).
- Evaluations limited to Coq 8.18; behavior on other proof assistants or Coq versions unknown.
- Performance drops on files with many external dependencies and on long human-written proofs.
- Inference required GPU for reasonable throughput; other tools were CPU-bound in comparisons.
When Not To Use
- Projects incompatible with Coq 8.18 or not compiling with CoqPyt extraction.
- Tiny projects with no helpful in-project proofs—Rango-PRE (local prefix) may be simpler.
- When strict reproducibility is required and LLM pretraining contamination is unacceptable.
Failure Modes
- Retrieval misses: needed lemmas/proofs not retrieved and model stalls or times out.
- Model hallucination when context lacks relevant facts, producing invalid tactics.
- Timeouts on hard proofs due to rollout search sampling limits.
- Degraded performance on files with many dependencies or very long proofs.
Core Entities
Models
- DeepSeek-Coder 1.3B
- CodeBERT 125M
Metrics
- Theorems proven (%)
- Proof length (# tactics)
- Edit distance to human proof
Datasets
- CoqStoq
Benchmarks
- CoqStoq benchmark
- CoqGym (prior benchmark)
Context Entities
Models
- Graph2Tac
- Tactician
- Proverbot9001
Metrics
- Timeout-limited proof success
- Ablation comparisons
Datasets
- CompCert (included in benchmark)
Benchmarks
- Projects compiled for Coq 8.18

