SWiM: a working-memory test exposes 'lost-in-the-middle' and fixes it with cheap medoid voting

July 4, 20246 min

Overview

Decision SnapshotNeeds Validation

SWiM is ready as a practical diagnostic for single-document QA. Medoid voting is low-cost and effective in tested settings. Results are limited to the datasets and models evaluated and need wider validation for multi-document or citation-heavy tasks.

Citations2

Evidence Strength0.70

Confidence0.78

Risk Signals9

Trust Signals

Findings with numeric evidence: 1/3

Findings with evidence refs: 3/3

Results with explicit delta: 1/1

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 60%

Authors

Amanda Dsouza, Christopher Glaze, Changho Shin, Frederic Sala

Links

Abstract / PDF / Code / Data

Why It Matters For Business

SWiM finds real-world failure modes that synthetic tests miss; medoid voting is a cheap, production-friendly fix that can raise QA accuracy without retraining.

Who Should Care

Summary TLDR

The authors introduce SWiM, a practical evaluation pipeline that benchmarks long-context LLMs on realistic document QA tasks and document-position effects. SWiM finds many long-context models lose accuracy when the answer is placed in the middle of the context window ('lost-in-the-middle'). They propose medoid voting: generate several outputs by randomly permuting document order, embed outputs, and pick the medoid. Medoid voting is training-free and improves single-document QA accuracy by up to 24.2 percentage points on evaluated models. Code and data pointers are provided.

Problem Statement

Standard long-context tests (like needle-in-a-haystack) use synthetic needles or toy tasks and miss practical failure modes. Practitioners need an end-to-end way to test models on their real documents, measure position and noise effects, and apply low-cost fixes when answers get 'lost in the middle'.

Main Contribution

SWiM: an end-to-end, customizable evaluation pipeline for long-context QA on user documents.

Empirical finding: many long-context models drop accuracy when the answer sits midway in the context (lost-in-the-middle).

Key Findings

Long-context models commonly perform worse when the answer document appears in the middle of the context window.

Practical UseWhen using LLMs on long documents, expect weaker retrieval if key info sits mid-window; test document positions on your data.

Evidence RefSection 4.2; Figure 3 (position tests show lower accuracy at 25–75% depths)

Medoid voting improves single-document QA accuracy by double-digit points on evaluated models.

NumbersGPT-4-Turbo +17.3 pts; GPT-3.5-Turbo-16k +24.2 pts

Practical UseRun 3–5 random-document permutations and pick the medoid output to boost accuracy without model changes.

Evidence RefSection 4.3; Figure 4

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracybaseline -> medoid votingmodel performance at 25% doc depthGPT-4-Turbo +17.3 pp; GPT-3.5-Turbo-16k +24.2 ppCosmopedia story_forum (GPT-4 generated QA)Section 4.3; Figure 4Section 4.3

What To Try In 7 Days

Run SWiM on a representative set of your documents to measure position sensitivity.

Implement medoid voting: 3–5 random permutations, embed outputs, pick medoid.

Filter 'no information' responses before medoid selection to avoid averaging away correct answers.

Agent Features

Memory
short-term (context window testing)
Frameworks
SWiM

Optimization Features

Inference Optimization
medoid voting (inference-time aggregation)

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

SWiM focuses on single-document QA and may not cover multi-document reasoning or citation tasks.

Evaluation uses GPT-generated QA and Cosmopedia stories; results may shift on other domains.

When Not To Use

When your task requires multi-document reasoning or precise citation linking.

When outputs must be deterministic and averaging across permutations is unacceptable.

Failure Modes

Medoid voting may smooth away a rare but correct answer (average effect).

Automatic LLM judges can mislabel correct answers, skewing measured accuracy.

Core Entities

Models

gpt-4gpt-4-turbogpt-4-0125-previewgpt-3.5-turbogpt-3.5-turbo-16kclaude-2.1claude-3-opus-20240229gemini-1.5-pro-preview-0409mixtral-8x7B-instruct-v0.1mistral-8x7b-instruct

Metrics

Accuracyeffective-context-usageposition-depth performance

Datasets

HuggingFace Cosmopedia (story_forum)

Benchmarks

SWiMNIAH (needle-in-a-haystack)RULERLongBenchLong Range Arena