Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
0
Why It Matters For Business
Curating verifier outputs yields compact, verified training data so small models can match large-model verification performance, lowering inference cost and enabling faster, parallel verification pipelines.
Summary TLDR
Invariant synthesis (finding loop properties that prove correctness) is a major verification bottleneck. WONDA is a two-stage curation pipeline that normalizes verifier-supplied invariants (AST rewrites) and uses an LLM to produce compact candidate invariants which are then formally re-checked. Fine-tuning small language models on WONDA data (V2, 7,283 samples) roughly doubles practical correctness and speedup rates on hard InvBench cases and yields a 4B SLM that matches GPT-OSS-120B and approaches GPT-5.2 on Virtual Best Performance (VBP) on evaluated benchmarks.
Problem Statement
Automated verifiers struggle on hard loop invariants. Off-the-shelf LLMs can propose invariants but often generate noisy, unhelpful or invalid candidates. Training LLMs on raw solver outputs propagates verifier artifacts and limits practical gains. The open problem: how to curate verifier outputs into high-quality training data that leads small models to produce invariants that actually speed up verification.
Main Contribution
Formalize properties of high-quality training invariants: non-degenerate, correct (inductive), useful (speeds up verification), compact.
WONDA: a pipeline that applies AST-based normalization, LLM-driven semantic simplification, and formal re-verification to produce verified, compact invariants.
Empirical show: fine-tuning SLMs on WONDA (V2) yields consistent, substantial verification gains; a fine-tuned 4B model matches GPT-OSS-120B VBP and approaches GPT-5.2 on evaluated hard cases.
Key Findings
WONDA-curated dataset size and makeup
Fine-tuning a 4B SLM on V2 doubles practical correctness and speedup on hard cases
Small model (4B) matches large-model VBP on evaluated benchmarks
Pipeline stages matter: normalization then simplification yields largest gains
Verified simplifications offer measurable solver time gains
Results
Qwen3-4B-V2 correctness (Hard)
Qwen3-4B-V2 speedup rate (Hard)
Qwen3-4B-V2 VBP
Dataset V2 size and grade split
Who Should Care
What To Try In 7 Days
Run AST normalization on existing verifier invariants to remove syntactic noise.
Use an LLM to propose compact invariant rewrites and formally re-check each candidate before using for training.
Fine-tune a 4B SLM on the curated (V2) subset and measure VBP against your verifier in a parallel portfolio setup.
Optimization Features
Token Efficiency
- Compact invariants reduce token length of targets (mean 15.8 tokens)
Training Optimization
- LoRA
Reproducibility
Code Urls
- GitHub: WONDA codebase (authors state release)
- Hugging Face: WONDA collection (trained models and curated datasets)
Data Urls
- Hugging Face: WONDA curated dataset (authors state release)
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Evaluation used a single backend verifier (UAutomizer); results may vary with other verifiers.
- Pipeline assumes C programs and SV-COMP-style tasks; cross-language generality is untested.
- Timeouts and baseline-unsolved instances remain hard; WONDA improves but does not eliminate them.
When Not To Use
- If your verifier already produces compact, high-quality invariants, curation may give little benefit.
- When verification tasks are not loop-invariant dominated or are outside the C/SV-COMP domain.
- If you cannot afford formal re-checking of simplified candidates (costly verifier runs).
Failure Modes
- LLM simplifier may produce candidates that are stronger/weaker or incomparable; must be filtered by verifier.
- Fine-tuning on uncurated or syntactically noisy outputs can reduce syntactic validity (observed in V0 experiments).
- Curated data may still encode biases from the original verifier if backend diversity is not introduced.
Core Entities
Models
- Qwen3-0.6B
- Qwen3-4B
- Qwen3-8B
- GPT-OSS-120B
- GPT-5.2
- Kimi K2
Metrics
- VBP (Virtual Best Performance)
- VALID / CORRECT / SPEEDUP rates
- mean speedup factor (¯S>1)
- VBP E2E (includes inference latency)
Datasets
- V0 (raw UAutomizer invariants)
- V1 (AST-normalized)
- V2 (WONDA-curated, 7,283 samples)
- InvBench programs (training/eval)
- SV-COMP derived evaluation set (hard/easy splits)
Benchmarks
- InvBench (evaluation suite)
- SV-COMP (program verification cases)

