Small prompt formatting changes can swing LLM accuracy by tens of points

October 17, 20238 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

40

Authors

Melanie Sclar, Yejin Choi, Yulia Tsvetkov, Alane Suhr

Links

Abstract / PDF

Why It Matters For Business

Small, innocuous prompt formatting choices can produce large and unpredictable swings in LLM performance, which can mislead model selection, harm user experience, or produce fragile products unless you test multiple formats.

Summary TLDR

Prompt formatting — separators, casing, spacing and tiny punctuation — can change few-shot LLM accuracy dramatically. Across 53 classification tasks and multiple models, the authors find median spreads of several accuracy points and extremes up to 76 points for LLaMA‑2‑13B. They introduce FORMATSPREAD, a Bayesian bandit method that samples plausible, semantically-equivalent prompt formats and estimates the expected performance interval cheaply (works on API models). Practical takeaway: report a spread over formats or run FORMATSPREAD when benchmarking or comparing models.

Problem Statement

Benchmarking and using LLMs often reports a single prompt format. The paper asks: how much do small, meaning-preserving formatting choices (spaces, separators, casing, item numbering) change model performance, and how can we cheaply estimate that variability without model weights?

Main Contribution

Systematic study showing large accuracy variance from meaning-preserving prompt formatting across 53 tasks and several LLMs.

FORMATSPREAD: a grammar + Bayesian bandit (Thompson sampling) procedure that samples plausible formats and estimates the performance spread under a compute budget and without model weights.

Analyses of which atomic formatting choices matter, evidence that format choices are identifiable in model embeddings, and empirical guidance on when single-format reporting is misleading.

Key Findings

Formatting can change accuracy by very large amounts.

NumbersMax spread 76 accuracy points (LLaMA-2-13B)

Typical variability is non-trivial across tasks and models.

NumbersMedian spread ≈ 7.5 accuracy points across sampled formats and models (53 tasks)

FORMATSPREAD finds the true spread efficiently with Bayesian sampling.

NumbersThompson sampling within 1 pt of true spread (E=51,200); naive within 4 pts; UCB within 11 pts

Prompt format is visible in model internal embeddings and correlates with spread.

NumbersFormat classifier ≥0.98 accuracy with top 100 PCs; corr with spread r=0.424 (p=8e-6)

Results

Maximum observed spread

Value76 accuracy points

Median spread across tasks (sampled formats)

Value7.5 accuracy points

FORMATSPREAD cost example (GPT-3.5)

Valuemedian spread 6.4 points across 320 formats and 53 tasks; average cost < $10/task

Accuracy

ValueThompson within 1 pt; naive within 4 pts; UCB within 11 pts

LLaMA-2-70B spread (1-shot, 320 formats)

Valuemedian 0.171 (17.1%), mean 0.221, max 0.876

Who Should Care

What To Try In 7 Days

Run FORMATSPREAD or sample 10–50 plausible prompt formats for your task and report the range of performance.

If comparing models, evaluate each model across the same set of formats or report spread to avoid biased comparisons.

Add a simple format-robustness test to CI: include a few variants of separators/casing and check performance drop.

Optimization Features

Inference Optimization

  • Used 4-bit quantization to run large model experiments

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Grammar of plausible formats is manually designed and may include unnatural but semantically equivalent strings.
  • Experiments focus on tasks with shorter instructions and classification/short-generation tasks; very long inputs were excluded.
  • FORMATSPREAD relies on sampled formats; reported spread is an estimate and depends on the chosen grammar and sample budget.

When Not To Use

  • If you control the whole system and a single prompt format reliably works in production, single-format evaluation may be sufficient.
  • When compute budget is extremely tight and you cannot spare any extra API or inference calls to sample formats.

Failure Modes

  • Degeneration: some formats cause models to produce no valid answer, skewing exact-match metrics.
  • Non-monotonic format space: local search can fail because small changes do not move smoothly toward better formats.
  • Human-unlikely formats in the grammar can inflate measured spread compared to real-world usage.

Core Entities

Models

  • LLaMA-2-7B
  • LLaMA-2-13B
  • LLaMA-2-70B
  • Falcon-7B
  • Falcon-7B-Instruct
  • GPT-3.5-Turbo

Metrics

  • Accuracy
  • ROUGE-L
  • BERTScore

Datasets

  • Super-NaturalInstructions (subset of 53 tasks)
  • Instruction Induction (10 generation tasks)

Benchmarks

  • Super-NaturalInstructions