WONDA: turn noisy verifier invariants into compact, verified training data that makes small models match big LLMs for loop-invariant tasks

March 16, 20267 min

Overview

Decision SnapshotReady For Pilot

The method is practical: it builds verified training signals from existing verifier traces, fine-tunes small models, and reports concrete wall-clock gains on a hard benchmark split; ablations show gains come from curation rather than scale.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 60%

Authors

Ido Pinto, Yizhak Yisrael Elboher, Haoze Wu, Nina Narodytska, Guy Katz

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Curating verifier outputs yields compact, verified training data so small models can match large-model verification performance, lowering inference cost and enabling faster, parallel verification pipelines.

Who Should Care

Summary TLDR

Invariant synthesis (finding loop properties that prove correctness) is a major verification bottleneck. WONDA is a two-stage curation pipeline that normalizes verifier-supplied invariants (AST rewrites) and uses an LLM to produce compact candidate invariants which are then formally re-checked. Fine-tuning small language models on WONDA data (V2, 7,283 samples) roughly doubles practical correctness and speedup rates on hard InvBench cases and yields a 4B SLM that matches GPT-OSS-120B and approaches GPT-5.2 on Virtual Best Performance (VBP) on evaluated benchmarks.

Problem Statement

Automated verifiers struggle on hard loop invariants. Off-the-shelf LLMs can propose invariants but often generate noisy, unhelpful or invalid candidates. Training LLMs on raw solver outputs propagates verifier artifacts and limits practical gains. The open problem: how to curate verifier outputs into high-quality training data that leads small models to produce invariants that actually speed up verification.

Main Contribution

Formalize properties of high-quality training invariants: non-degenerate, correct (inductive), useful (speeds up verification), compact.

WONDA: a pipeline that applies AST-based normalization, LLM-driven semantic simplification, and formal re-verification to produce verified, compact invariants.

Key Findings

WONDA-curated dataset size and makeup

Numbers7,283 samples; Grade2=4,516; Grade3=2,767

Practical UseYou can build a compact, labeled dataset (~7k verified samples) that separates 'correct' vs 'provides speedup' examples for fine-tuning.

Evidence RefSection 5, Appendix D (V2 dataset stats)

Fine-tuning a 4B SLM on V2 doubles practical correctness and speedup on hard cases

NumbersCorrect: 22.8%44.4%; Speedup: 11.1%24.7%

Practical UseFine-tune small models on curated invariants to get ~2× more correct and practically useful invariants on hard benchmarks.

Evidence RefTable 3 (Qwen3-4B base vs V2)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Qwen3-4B-V2 correctness (Hard)44.4%Qwen3-4B base 22.8%+21.6ppInvBench Hard (n=123)Table 3 (Qwen3-4B base vs V2)Table 3
Qwen3-4B-V2 speedup rate (Hard)24.7%Qwen3-4B base 11.1%+13.6ppInvBench Hard (n=123)Table 3 (speedup rates)Table 3

What To Try In 7 Days

Run AST normalization on existing verifier invariants to remove syntactic noise.

Use an LLM to propose compact invariant rewrites and formally re-check each candidate before using for training.

Fine-tune a 4B SLM on the curated (V2) subset and measure VBP against your verifier in a parallel portfolio setup.

Optimization Features

Token Efficiency
Compact invariants reduce token length of targets (mean 15.8 tokens)
Training Optimization
LoRA

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Code URLs

GitHub: WONDA codebase (authors state release)Hugging Face: WONDA collection (trained models and curated datasets)

Data URLs

Hugging Face: WONDA curated dataset (authors state release)

Risks & Boundaries

Limitations

Evaluation used a single backend verifier (UAutomizer); results may vary with other verifiers.

Pipeline assumes C programs and SV-COMP-style tasks; cross-language generality is untested.

When Not To Use

If your verifier already produces compact, high-quality invariants, curation may give little benefit.

When verification tasks are not loop-invariant dominated or are outside the C/SV-COMP domain.

Failure Modes

LLM simplifier may produce candidates that are stronger/weaker or incomparable; must be filtered by verifier.

Fine-tuning on uncurated or syntactically noisy outputs can reduce syntactic validity (observed in V0 experiments).

Core Entities

Models

Qwen3-0.6BQwen3-4BQwen3-8BGPT-OSS-120BGPT-5.2Kimi K2

Metrics

VBP (Virtual Best Performance)VALID / CORRECT / SPEEDUP ratesmean speedup factor (¯S>1)VBP E2E (includes inference latency)

Datasets

V0 (raw UAutomizer invariants)V1 (AST-normalized)V2 (WONDA-curated, 7,283 samples)InvBench programs (training/eval)SV-COMP derived evaluation set (hard/easy splits)

Benchmarks

InvBench (evaluation suite)SV-COMP (program verification cases)