Overview
Paper demonstrates large, reproducible format sensitivity across many tasks and models, and shows an efficient, practical method (FORMATSPREAD) to estimate expected performance ranges without model weights.
Citations40
Evidence Strength0.80
Confidence0.85
Risk Signals8
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 0/5
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
Small, innocuous prompt formatting choices can produce large and unpredictable swings in LLM performance, which can mislead model selection, harm user experience, or produce fragile products unless you test multiple formats.
Who Should Care
Summary TLDR
Prompt formatting — separators, casing, spacing and tiny punctuation — can change few-shot LLM accuracy dramatically. Across 53 classification tasks and multiple models, the authors find median spreads of several accuracy points and extremes up to 76 points for LLaMA‑2‑13B. They introduce FORMATSPREAD, a Bayesian bandit method that samples plausible, semantically-equivalent prompt formats and estimates the expected performance interval cheaply (works on API models). Practical takeaway: report a spread over formats or run FORMATSPREAD when benchmarking or comparing models.
Problem Statement
Benchmarking and using LLMs often reports a single prompt format. The paper asks: how much do small, meaning-preserving formatting choices (spaces, separators, casing, item numbering) change model performance, and how can we cheaply estimate that variability without model weights?
Main Contribution
Systematic study showing large accuracy variance from meaning-preserving prompt formatting across 53 tasks and several LLMs.
FORMATSPREAD: a grammar + Bayesian bandit (Thompson sampling) procedure that samples plausible formats and estimates the performance spread under a compute budget and without model weights.
Key Findings
Formatting can change accuracy by very large amounts.
Typical variability is non-trivial across tasks and models.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Maximum observed spread | 76 accuracy points | — | — | LLaMA-2-13B on Super-NaturalInstructions examples | Section 4.2; Abstract | Abstract; Section 4.2 |
| Median spread across tasks (sampled formats) | 7.5 accuracy points | — | — | 53 tasks, multiple models, 10 formats sampled each | Section 4.2 | Section 4.2; Figure 3 |
What To Try In 7 Days
Run FORMATSPREAD or sample 10–50 plausible prompt formats for your task and report the range of performance.
If comparing models, evaluate each model across the same set of formats or report spread to avoid biased comparisons.
Add a simple format-robustness test to CI: include a few variants of separators/casing and check performance drop.
Optimization Features
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Grammar of plausible formats is manually designed and may include unnatural but semantically equivalent strings.
Experiments focus on tasks with shorter instructions and classification/short-generation tasks; very long inputs were excluded.
When Not To Use
If you control the whole system and a single prompt format reliably works in production, single-format evaluation may be sufficient.
When compute budget is extremely tight and you cannot spare any extra API or inference calls to sample formats.
Failure Modes
Degeneration: some formats cause models to produce no valid answer, skewing exact-match metrics.
Non-monotonic format space: local search can fail because small changes do not move smoothly toward better formats.

