Overview
The method is practical: it builds verified training signals from existing verifier traces, fine-tunes small models, and reports concrete wall-clock gains on a hard benchmark split; ablations show gains come from curation rather than scale.
Citations0
Evidence Strength0.80
Confidence0.85
Risk Signals9
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 3/4
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 70%
Production readiness: 70%
Novelty: 60%
Why It Matters For Business
Curating verifier outputs yields compact, verified training data so small models can match large-model verification performance, lowering inference cost and enabling faster, parallel verification pipelines.
Who Should Care
Summary TLDR
Invariant synthesis (finding loop properties that prove correctness) is a major verification bottleneck. WONDA is a two-stage curation pipeline that normalizes verifier-supplied invariants (AST rewrites) and uses an LLM to produce compact candidate invariants which are then formally re-checked. Fine-tuning small language models on WONDA data (V2, 7,283 samples) roughly doubles practical correctness and speedup rates on hard InvBench cases and yields a 4B SLM that matches GPT-OSS-120B and approaches GPT-5.2 on Virtual Best Performance (VBP) on evaluated benchmarks.
Problem Statement
Automated verifiers struggle on hard loop invariants. Off-the-shelf LLMs can propose invariants but often generate noisy, unhelpful or invalid candidates. Training LLMs on raw solver outputs propagates verifier artifacts and limits practical gains. The open problem: how to curate verifier outputs into high-quality training data that leads small models to produce invariants that actually speed up verification.
Main Contribution
Formalize properties of high-quality training invariants: non-degenerate, correct (inductive), useful (speeds up verification), compact.
WONDA: a pipeline that applies AST-based normalization, LLM-driven semantic simplification, and formal re-verification to produce verified, compact invariants.
Key Findings
WONDA-curated dataset size and makeup
Fine-tuning a 4B SLM on V2 doubles practical correctness and speedup on hard cases
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Qwen3-4B-V2 correctness (Hard) | 44.4% | Qwen3-4B base 22.8% | +21.6pp | InvBench Hard (n=123) | Table 3 (Qwen3-4B base vs V2) | Table 3 |
| Qwen3-4B-V2 speedup rate (Hard) | 24.7% | Qwen3-4B base 11.1% | +13.6pp | InvBench Hard (n=123) | Table 3 (speedup rates) | Table 3 |
What To Try In 7 Days
Run AST normalization on existing verifier invariants to remove syntactic noise.
Use an LLM to propose compact invariant rewrites and formally re-check each candidate before using for training.
Fine-tune a 4B SLM on the curated (V2) subset and measure VBP against your verifier in a parallel portfolio setup.
Optimization Features
Token Efficiency
Training Optimization
Reproducibility
Code URLs
Data URLs
Risks & Boundaries
Limitations
Evaluation used a single backend verifier (UAutomizer); results may vary with other verifiers.
Pipeline assumes C programs and SV-COMP-style tasks; cross-language generality is untested.
When Not To Use
If your verifier already produces compact, high-quality invariants, curation may give little benefit.
When verification tasks are not loop-invariant dominated or are outside the C/SV-COMP domain.
Failure Modes
LLM simplifier may produce candidates that are stronger/weaker or incomparable; must be filtered by verifier.
Fine-tuning on uncurated or syntactically noisy outputs can reduce syntactic validity (observed in V0 experiments).

