WONDA: turn noisy verifier invariants into compact, verified training data that makes small models match big LLMs for loop-invariant tasks

March 16, 20267 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

0

Authors

Ido Pinto, Yizhak Yisrael Elboher, Haoze Wu, Nina Narodytska, Guy Katz

Links

Abstract / PDF

Why It Matters For Business

Curating verifier outputs yields compact, verified training data so small models can match large-model verification performance, lowering inference cost and enabling faster, parallel verification pipelines.

Summary TLDR

Invariant synthesis (finding loop properties that prove correctness) is a major verification bottleneck. WONDA is a two-stage curation pipeline that normalizes verifier-supplied invariants (AST rewrites) and uses an LLM to produce compact candidate invariants which are then formally re-checked. Fine-tuning small language models on WONDA data (V2, 7,283 samples) roughly doubles practical correctness and speedup rates on hard InvBench cases and yields a 4B SLM that matches GPT-OSS-120B and approaches GPT-5.2 on Virtual Best Performance (VBP) on evaluated benchmarks.

Problem Statement

Automated verifiers struggle on hard loop invariants. Off-the-shelf LLMs can propose invariants but often generate noisy, unhelpful or invalid candidates. Training LLMs on raw solver outputs propagates verifier artifacts and limits practical gains. The open problem: how to curate verifier outputs into high-quality training data that leads small models to produce invariants that actually speed up verification.

Main Contribution

Formalize properties of high-quality training invariants: non-degenerate, correct (inductive), useful (speeds up verification), compact.

WONDA: a pipeline that applies AST-based normalization, LLM-driven semantic simplification, and formal re-verification to produce verified, compact invariants.

Empirical show: fine-tuning SLMs on WONDA (V2) yields consistent, substantial verification gains; a fine-tuned 4B model matches GPT-OSS-120B VBP and approaches GPT-5.2 on evaluated hard cases.

Key Findings

WONDA-curated dataset size and makeup

Numbers7,283 samples; Grade2=4,516; Grade3=2,767

Fine-tuning a 4B SLM on V2 doubles practical correctness and speedup on hard cases

NumbersCorrect: 22.8% → 44.4%; Speedup: 11.1% → 24.7%

Small model (4B) matches large-model VBP on evaluated benchmarks

NumbersQwen3-4B-V2 VBP 165.5s ≈ GPT-OSS-120B VBP 165.8s; GPT-5.2 VBP 155.6s

Pipeline stages matter: normalization then simplification yields largest gains

Numbers8B correctness: base 23.9% → V1 30.1% → V2 42.8%; speedup grows from 10.8% → 13.0% → 21.7%

Verified simplifications offer measurable solver time gains

NumbersVBP reductions of ~12–20s; Grade3 mean speedup 2.13×, peaks up to 41.39×

Results

Qwen3-4B-V2 correctness (Hard)

Value44.4%

BaselineQwen3-4B base 22.8%

Qwen3-4B-V2 speedup rate (Hard)

Value24.7%

BaselineQwen3-4B base 11.1%

Qwen3-4B-V2 VBP

Value165.5s

BaselineGPT-5.2 VBP 155.6s; GPT-OSS-120B 165.8s

Dataset V2 size and grade split

Value7,283 samples (Grade2: 4,516; Grade3: 2,767)

Who Should Care

What To Try In 7 Days

Run AST normalization on existing verifier invariants to remove syntactic noise.

Use an LLM to propose compact invariant rewrites and formally re-check each candidate before using for training.

Fine-tune a 4B SLM on the curated (V2) subset and measure VBP against your verifier in a parallel portfolio setup.

Optimization Features

Token Efficiency

  • Compact invariants reduce token length of targets (mean 15.8 tokens)

Training Optimization

  • LoRA

Reproducibility

Code Urls

  • GitHub: WONDA codebase (authors state release)
  • Hugging Face: WONDA collection (trained models and curated datasets)

Data Urls

  • Hugging Face: WONDA curated dataset (authors state release)

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Evaluation used a single backend verifier (UAutomizer); results may vary with other verifiers.
  • Pipeline assumes C programs and SV-COMP-style tasks; cross-language generality is untested.
  • Timeouts and baseline-unsolved instances remain hard; WONDA improves but does not eliminate them.

When Not To Use

  • If your verifier already produces compact, high-quality invariants, curation may give little benefit.
  • When verification tasks are not loop-invariant dominated or are outside the C/SV-COMP domain.
  • If you cannot afford formal re-checking of simplified candidates (costly verifier runs).

Failure Modes

  • LLM simplifier may produce candidates that are stronger/weaker or incomparable; must be filtered by verifier.
  • Fine-tuning on uncurated or syntactically noisy outputs can reduce syntactic validity (observed in V0 experiments).
  • Curated data may still encode biases from the original verifier if backend diversity is not introduced.

Core Entities

Models

  • Qwen3-0.6B
  • Qwen3-4B
  • Qwen3-8B
  • GPT-OSS-120B
  • GPT-5.2
  • Kimi K2

Metrics

  • VBP (Virtual Best Performance)
  • VALID / CORRECT / SPEEDUP rates
  • mean speedup factor (¯S>1)
  • VBP E2E (includes inference latency)

Datasets

  • V0 (raw UAutomizer invariants)
  • V1 (AST-normalized)
  • V2 (WONDA-curated, 7,283 samples)
  • InvBench programs (training/eval)
  • SV-COMP derived evaluation set (hard/easy splits)

Benchmarks

  • InvBench (evaluation suite)
  • SV-COMP (program verification cases)