Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.5
Citation Count
2
Why It Matters For Business
SWiM finds real-world failure modes that synthetic tests miss; medoid voting is a cheap, production-friendly fix that can raise QA accuracy without retraining.
Summary TLDR
The authors introduce SWiM, a practical evaluation pipeline that benchmarks long-context LLMs on realistic document QA tasks and document-position effects. SWiM finds many long-context models lose accuracy when the answer is placed in the middle of the context window ('lost-in-the-middle'). They propose medoid voting: generate several outputs by randomly permuting document order, embed outputs, and pick the medoid. Medoid voting is training-free and improves single-document QA accuracy by up to 24.2 percentage points on evaluated models. Code and data pointers are provided.
Problem Statement
Standard long-context tests (like needle-in-a-haystack) use synthetic needles or toy tasks and miss practical failure modes. Practitioners need an end-to-end way to test models on their real documents, measure position and noise effects, and apply low-cost fixes when answers get 'lost in the middle'.
Main Contribution
SWiM: an end-to-end, customizable evaluation pipeline for long-context QA on user documents.
Empirical finding: many long-context models drop accuracy when the answer sits midway in the context (lost-in-the-middle).
Medoid voting: a training-free inference-time correction that selects a consensus output across random document permutations.
Key Findings
Long-context models commonly perform worse when the answer document appears in the middle of the context window.
Medoid voting improves single-document QA accuracy by double-digit points on evaluated models.
NIAH-style tests can give misleadingly optimistic results compared to SWiM on realistic QA tasks.
Results
Accuracy
Who Should Care
What To Try In 7 Days
Run SWiM on a representative set of your documents to measure position sensitivity.
Implement medoid voting: 3–5 random permutations, embed outputs, pick medoid.
Filter 'no information' responses before medoid selection to avoid averaging away correct answers.
Agent Features
Memory
- short-term (context window testing)
Frameworks
- SWiM
Optimization Features
Inference Optimization
- medoid voting (inference-time aggregation)
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- SWiM focuses on single-document QA and may not cover multi-document reasoning or citation tasks.
- Evaluation uses GPT-generated QA and Cosmopedia stories; results may shift on other domains.
- LLM-as-a-judge can be inconsistent; the framework requires human validation to ensure label quality.
- Medoid voting can reduce best-case performance by averaging unless filtered.
When Not To Use
- When your task requires multi-document reasoning or precise citation linking.
- When outputs must be deterministic and averaging across permutations is unacceptable.
Failure Modes
- Medoid voting may smooth away a rare but correct answer (average effect).
- Automatic LLM judges can mislabel correct answers, skewing measured accuracy.
- Model behavior may differ outside the tested dataset and document formats.
Core Entities
Models
- gpt-4
- gpt-4-turbo
- gpt-4-0125-preview
- gpt-3.5-turbo
- gpt-3.5-turbo-16k
- claude-2.1
- claude-3-opus-20240229
- gemini-1.5-pro-preview-0409
- mixtral-8x7B-instruct-v0.1
- mistral-8x7b-instruct
Metrics
- Accuracy
- effective-context-usage
- position-depth performance
Datasets
- HuggingFace Cosmopedia (story_forum)
Benchmarks
- SWiM
- NIAH (needle-in-a-haystack)
- RULER
- LongBench
- Long Range Arena

