Small prompt formatting changes can swing LLM accuracy by tens of points

October 17, 20238 min

Overview

Decision SnapshotReady For Pilot

Paper demonstrates large, reproducible format sensitivity across many tasks and models, and shows an efficient, practical method (FORMATSPREAD) to estimate expected performance ranges without model weights.

Citations40

Evidence Strength0.80

Confidence0.85

Risk Signals8

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 0/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 60%

Authors

Melanie Sclar, Yejin Choi, Yulia Tsvetkov, Alane Suhr

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Small, innocuous prompt formatting choices can produce large and unpredictable swings in LLM performance, which can mislead model selection, harm user experience, or produce fragile products unless you test multiple formats.

Who Should Care

Summary TLDR

Prompt formatting — separators, casing, spacing and tiny punctuation — can change few-shot LLM accuracy dramatically. Across 53 classification tasks and multiple models, the authors find median spreads of several accuracy points and extremes up to 76 points for LLaMA‑2‑13B. They introduce FORMATSPREAD, a Bayesian bandit method that samples plausible, semantically-equivalent prompt formats and estimates the expected performance interval cheaply (works on API models). Practical takeaway: report a spread over formats or run FORMATSPREAD when benchmarking or comparing models.

Problem Statement

Benchmarking and using LLMs often reports a single prompt format. The paper asks: how much do small, meaning-preserving formatting choices (spaces, separators, casing, item numbering) change model performance, and how can we cheaply estimate that variability without model weights?

Main Contribution

Systematic study showing large accuracy variance from meaning-preserving prompt formatting across 53 tasks and several LLMs.

FORMATSPREAD: a grammar + Bayesian bandit (Thompson sampling) procedure that samples plausible formats and estimates the performance spread under a compute budget and without model weights.

Key Findings

Formatting can change accuracy by very large amounts.

NumbersMax spread 76 accuracy points (LLaMA-2-13B)

Practical UseDon't trust a single-format score for a model — test multiple plausible formats or report a range when benchmarking.

Evidence RefAbstract; Section 4.2; Table 2

Typical variability is non-trivial across tasks and models.

NumbersMedian spread ≈ 7.5 accuracy points across sampled formats and models (53 tasks)

Practical UseExpect single-digit accuracy swings commonly; include format spread in evaluations to avoid misleading comparisons.

Evidence RefSection 4.2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Maximum observed spread76 accuracy pointsLLaMA-2-13B on Super-NaturalInstructions examplesSection 4.2; AbstractAbstract; Section 4.2
Median spread across tasks (sampled formats)7.5 accuracy points53 tasks, multiple models, 10 formats sampled eachSection 4.2Section 4.2; Figure 3

What To Try In 7 Days

Run FORMATSPREAD or sample 10–50 plausible prompt formats for your task and report the range of performance.

If comparing models, evaluate each model across the same set of formats or report spread to avoid biased comparisons.

Add a simple format-robustness test to CI: include a few variants of separators/casing and check performance drop.

Optimization Features

Inference Optimization
Used 4-bit quantization to run large model experiments

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Grammar of plausible formats is manually designed and may include unnatural but semantically equivalent strings.

Experiments focus on tasks with shorter instructions and classification/short-generation tasks; very long inputs were excluded.

When Not To Use

If you control the whole system and a single prompt format reliably works in production, single-format evaluation may be sufficient.

When compute budget is extremely tight and you cannot spare any extra API or inference calls to sample formats.

Failure Modes

Degeneration: some formats cause models to produce no valid answer, skewing exact-match metrics.

Non-monotonic format space: local search can fail because small changes do not move smoothly toward better formats.

Core Entities

Models

LLaMA-2-7BLLaMA-2-13BLLaMA-2-70BFalcon-7BFalcon-7B-InstructGPT-3.5-Turbo

Metrics

AccuracyROUGE-LBERTScore

Datasets

Super-NaturalInstructions (subset of 53 tasks)Instruction Induction (10 generation tasks)

Benchmarks

Super-NaturalInstructions