Overview
SWiM is ready as a practical diagnostic for single-document QA. Medoid voting is low-cost and effective in tested settings. Results are limited to the datasets and models evaluated and need wider validation for multi-document or citation-heavy tasks.
Citations2
Evidence Strength0.70
Confidence0.78
Risk Signals9
Trust Signals
Findings with numeric evidence: 1/3
Findings with evidence refs: 3/3
Results with explicit delta: 1/1
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 50%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
SWiM finds real-world failure modes that synthetic tests miss; medoid voting is a cheap, production-friendly fix that can raise QA accuracy without retraining.
Who Should Care
Summary TLDR
The authors introduce SWiM, a practical evaluation pipeline that benchmarks long-context LLMs on realistic document QA tasks and document-position effects. SWiM finds many long-context models lose accuracy when the answer is placed in the middle of the context window ('lost-in-the-middle'). They propose medoid voting: generate several outputs by randomly permuting document order, embed outputs, and pick the medoid. Medoid voting is training-free and improves single-document QA accuracy by up to 24.2 percentage points on evaluated models. Code and data pointers are provided.
Problem Statement
Standard long-context tests (like needle-in-a-haystack) use synthetic needles or toy tasks and miss practical failure modes. Practitioners need an end-to-end way to test models on their real documents, measure position and noise effects, and apply low-cost fixes when answers get 'lost in the middle'.
Main Contribution
SWiM: an end-to-end, customizable evaluation pipeline for long-context QA on user documents.
Empirical finding: many long-context models drop accuracy when the answer sits midway in the context (lost-in-the-middle).
Key Findings
Long-context models commonly perform worse when the answer document appears in the middle of the context window.
Medoid voting improves single-document QA accuracy by double-digit points on evaluated models.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | baseline -> medoid voting | model performance at 25% doc depth | GPT-4-Turbo +17.3 pp; GPT-3.5-Turbo-16k +24.2 pp | Cosmopedia story_forum (GPT-4 generated QA) | Section 4.3; Figure 4 | Section 4.3 |
What To Try In 7 Days
Run SWiM on a representative set of your documents to measure position sensitivity.
Implement medoid voting: 3–5 random permutations, embed outputs, pick medoid.
Filter 'no information' responses before medoid selection to avoid averaging away correct answers.
Agent Features
Memory
Frameworks
Optimization Features
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
SWiM focuses on single-document QA and may not cover multi-document reasoning or citation tasks.
Evaluation uses GPT-generated QA and Cosmopedia stories; results may shift on other domains.
When Not To Use
When your task requires multi-document reasoning or precise citation linking.
When outputs must be deterministic and averaging across permutations is unacceptable.
Failure Modes
Medoid voting may smooth away a rare but correct answer (average effect).
Automatic LLM judges can mislabel correct answers, skewing measured accuracy.

