Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.6
Citation Count
40
Why It Matters For Business
Small, innocuous prompt formatting choices can produce large and unpredictable swings in LLM performance, which can mislead model selection, harm user experience, or produce fragile products unless you test multiple formats.
Summary TLDR
Prompt formatting — separators, casing, spacing and tiny punctuation — can change few-shot LLM accuracy dramatically. Across 53 classification tasks and multiple models, the authors find median spreads of several accuracy points and extremes up to 76 points for LLaMA‑2‑13B. They introduce FORMATSPREAD, a Bayesian bandit method that samples plausible, semantically-equivalent prompt formats and estimates the expected performance interval cheaply (works on API models). Practical takeaway: report a spread over formats or run FORMATSPREAD when benchmarking or comparing models.
Problem Statement
Benchmarking and using LLMs often reports a single prompt format. The paper asks: how much do small, meaning-preserving formatting choices (spaces, separators, casing, item numbering) change model performance, and how can we cheaply estimate that variability without model weights?
Main Contribution
Systematic study showing large accuracy variance from meaning-preserving prompt formatting across 53 tasks and several LLMs.
FORMATSPREAD: a grammar + Bayesian bandit (Thompson sampling) procedure that samples plausible formats and estimates the performance spread under a compute budget and without model weights.
Analyses of which atomic formatting choices matter, evidence that format choices are identifiable in model embeddings, and empirical guidance on when single-format reporting is misleading.
Key Findings
Formatting can change accuracy by very large amounts.
Typical variability is non-trivial across tasks and models.
FORMATSPREAD finds the true spread efficiently with Bayesian sampling.
Prompt format is visible in model internal embeddings and correlates with spread.
Results
Maximum observed spread
Median spread across tasks (sampled formats)
FORMATSPREAD cost example (GPT-3.5)
Accuracy
LLaMA-2-70B spread (1-shot, 320 formats)
Who Should Care
What To Try In 7 Days
Run FORMATSPREAD or sample 10–50 plausible prompt formats for your task and report the range of performance.
If comparing models, evaluate each model across the same set of formats or report spread to avoid biased comparisons.
Add a simple format-robustness test to CI: include a few variants of separators/casing and check performance drop.
Optimization Features
Inference Optimization
- Used 4-bit quantization to run large model experiments
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Grammar of plausible formats is manually designed and may include unnatural but semantically equivalent strings.
- Experiments focus on tasks with shorter instructions and classification/short-generation tasks; very long inputs were excluded.
- FORMATSPREAD relies on sampled formats; reported spread is an estimate and depends on the chosen grammar and sample budget.
When Not To Use
- If you control the whole system and a single prompt format reliably works in production, single-format evaluation may be sufficient.
- When compute budget is extremely tight and you cannot spare any extra API or inference calls to sample formats.
Failure Modes
- Degeneration: some formats cause models to produce no valid answer, skewing exact-match metrics.
- Non-monotonic format space: local search can fail because small changes do not move smoothly toward better formats.
- Human-unlikely formats in the grammar can inflate measured spread compared to real-world usage.
Core Entities
Models
- LLaMA-2-7B
- LLaMA-2-13B
- LLaMA-2-70B
- Falcon-7B
- Falcon-7B-Instruct
- GPT-3.5-Turbo
Metrics
- Accuracy
- ROUGE-L
- BERTScore
Datasets
- Super-NaturalInstructions (subset of 53 tasks)
- Instruction Induction (10 generation tasks)
Benchmarks
- Super-NaturalInstructions

