Small prompt formatting changes can swing LLM accuracy by tens of points

Overview

Decision SnapshotReady For Pilot

Paper demonstrates large, reproducible format sensitivity across many tasks and models, and shows an efficient, practical method (FORMATSPREAD) to estimate expected performance ranges without model weights.

Citations40

Evidence Strength0.80

Confidence0.85

Risk Signals8

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 0/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 60%

Authors

Melanie Sclar, Yejin Choi, Yulia Tsvetkov, Alane Suhr

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Small, innocuous prompt formatting choices can produce large and unpredictable swings in LLM performance, which can mislead model selection, harm user experience, or produce fragile products unless you test multiple formats.

Who Should Care

CTO Product Manager ML Engineer Data Scientist Engineering Lead

Summary TLDR

Prompt formatting — separators, casing, spacing and tiny punctuation — can change few-shot LLM accuracy dramatically. Across 53 classification tasks and multiple models, the authors find median spreads of several accuracy points and extremes up to 76 points for LLaMA‑2‑13B. They introduce FORMATSPREAD, a Bayesian bandit method that samples plausible, semantically-equivalent prompt formats and estimates the expected performance interval cheaply (works on API models). Practical takeaway: report a spread over formats or run FORMATSPREAD when benchmarking or comparing models.

Problem Statement

Benchmarking and using LLMs often reports a single prompt format. The paper asks: how much do small, meaning-preserving formatting choices (spaces, separators, casing, item numbering) change model performance, and how can we cheaply estimate that variability without model weights?

Main Contribution

Systematic study showing large accuracy variance from meaning-preserving prompt formatting across 53 tasks and several LLMs.

FORMATSPREAD: a grammar + Bayesian bandit (Thompson sampling) procedure that samples plausible formats and estimates the performance spread under a compute budget and without model weights.

Key Findings

Formatting can change accuracy by very large amounts.

NumbersMax spread 76 accuracy points (LLaMA-2-13B)

Practical UseDon't trust a single-format score for a model — test multiple plausible formats or report a range when benchmarking.

Evidence RefAbstract; Section 4.2; Table 2

Typical variability is non-trivial across tasks and models.

NumbersMedian spread ≈ 7.5 accuracy points across sampled formats and models (53 tasks)

Practical UseExpect single-digit accuracy swings commonly; include format spread in evaluations to avoid misleading comparisons.

Evidence RefSection 4.2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Maximum observed spread	76 accuracy points	—	—	LLaMA-2-13B on Super-NaturalInstructions examples	Section 4.2; Abstract	Abstract; Section 4.2
Median spread across tasks (sampled formats)	7.5 accuracy points	—	—	53 tasks, multiple models, 10 formats sampled each	Section 4.2	Section 4.2; Figure 3

What To Try In 7 Days

Run FORMATSPREAD or sample 10–50 plausible prompt formats for your task and report the range of performance.

If comparing models, evaluate each model across the same set of formats or report spread to avoid biased comparisons.

Add a simple format-robustness test to CI: include a few variants of separators/casing and check performance drop.

Optimization Features

Inference Optimization

Used 4-bit quantization to run large model experiments

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/msclar/formatspread

Data URLs

https://github.com/allenai/natural-instructions (Super-NaturalInstructions)

Risks & Boundaries

Limitations

Grammar of plausible formats is manually designed and may include unnatural but semantically equivalent strings.

Experiments focus on tasks with shorter instructions and classification/short-generation tasks; very long inputs were excluded.

When Not To Use

If you control the whole system and a single prompt format reliably works in production, single-format evaluation may be sufficient.

When compute budget is extremely tight and you cannot spare any extra API or inference calls to sample formats.

Failure Modes

Degeneration: some formats cause models to produce no valid answer, skewing exact-match metrics.

Non-monotonic format space: local search can fail because small changes do not move smoothly toward better formats.

Core Entities

Models

LLaMA-2-7BLLaMA-2-13BLLaMA-2-70BFalcon-7BFalcon-7B-InstructGPT-3.5-Turbo

Metrics

AccuracyROUGE-LBERTScore

Datasets

Super-NaturalInstructions (subset of 53 tasks)Instruction Induction (10 generation tasks)

Benchmarks

Super-NaturalInstructions

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Formatting can change accuracy by very large amounts.

Typical variability is non-trivial across tasks and models.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Use LLMs to synthesize context examples and cut expert annotation by ~40–60% for biomedical entity linking

Key finding

LLM judges are prompt‑sensitive and internally noisy; here's a explainable toolkit to measure and de-noise them

Key finding

SCORE: report accuracy ranges and consistency, not just one score

Key finding

Open-source, reproducible benchmark that compares 10+ LLMs on 20+ tasks and traces the path from GPT-3 to GPT-4

Key finding

KemenkeuGPT: a LangChain+RAG LLM for Indonesian finance that raised accuracy from 35% to 61%

Key finding