SWiM: a working-memory test exposes 'lost-in-the-middle' and fixes it with cheap medoid voting

Overview

Decision SnapshotNeeds Validation

SWiM is ready as a practical diagnostic for single-document QA. Medoid voting is low-cost and effective in tested settings. Results are limited to the datasets and models evaluated and need wider validation for multi-document or citation-heavy tasks.

Citations2

Evidence Strength0.70

Confidence0.78

Risk Signals9

Trust Signals

Findings with numeric evidence: 1/3

Findings with evidence refs: 3/3

Results with explicit delta: 1/1

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 60%

Authors

Amanda Dsouza, Christopher Glaze, Changho Shin, Frederic Sala

Links

Abstract / PDF / Code / Data

Why It Matters For Business

SWiM finds real-world failure modes that synthetic tests miss; medoid voting is a cheap, production-friendly fix that can raise QA accuracy without retraining.

Who Should Care

Product Manager ML Engineer Data Scientist CTO

Summary TLDR

The authors introduce SWiM, a practical evaluation pipeline that benchmarks long-context LLMs on realistic document QA tasks and document-position effects. SWiM finds many long-context models lose accuracy when the answer is placed in the middle of the context window ('lost-in-the-middle'). They propose medoid voting: generate several outputs by randomly permuting document order, embed outputs, and pick the medoid. Medoid voting is training-free and improves single-document QA accuracy by up to 24.2 percentage points on evaluated models. Code and data pointers are provided.

Problem Statement

Standard long-context tests (like needle-in-a-haystack) use synthetic needles or toy tasks and miss practical failure modes. Practitioners need an end-to-end way to test models on their real documents, measure position and noise effects, and apply low-cost fixes when answers get 'lost in the middle'.

Main Contribution

SWiM: an end-to-end, customizable evaluation pipeline for long-context QA on user documents.

Empirical finding: many long-context models drop accuracy when the answer sits midway in the context (lost-in-the-middle).

Key Findings

Long-context models commonly perform worse when the answer document appears in the middle of the context window.

Practical UseWhen using LLMs on long documents, expect weaker retrieval if key info sits mid-window; test document positions on your data.

Evidence RefSection 4.2; Figure 3 (position tests show lower accuracy at 25–75% depths)

Medoid voting improves single-document QA accuracy by double-digit points on evaluated models.

NumbersGPT-4-Turbo +17.3 pts; GPT-3.5-Turbo-16k +24.2 pts

Practical UseRun 3–5 random-document permutations and pick the medoid output to boost accuracy without model changes.

Evidence RefSection 4.3; Figure 4

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	baseline -> medoid voting	model performance at 25% doc depth	GPT-4-Turbo +17.3 pp; GPT-3.5-Turbo-16k +24.2 pp	Cosmopedia story_forum (GPT-4 generated QA)	Section 4.3; Figure 4	Section 4.3

What To Try In 7 Days

Run SWiM on a representative set of your documents to measure position sensitivity.

Implement medoid voting: 3–5 random permutations, embed outputs, pick medoid.

Filter 'no information' responses before medoid selection to avoid averaging away correct answers.

Agent Features

Memory

short-term (context window testing)

Frameworks

SWiM

Optimization Features

Inference Optimization

medoid voting (inference-time aggregation)

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/snorkel-ai/long-context-eval

Data URLs

https://huggingface.co/datasets/HuggingFaceTB/cosmopedia

Risks & Boundaries

Limitations

SWiM focuses on single-document QA and may not cover multi-document reasoning or citation tasks.

Evaluation uses GPT-generated QA and Cosmopedia stories; results may shift on other domains.

When Not To Use

When your task requires multi-document reasoning or precise citation linking.

When outputs must be deterministic and averaging across permutations is unacceptable.

Failure Modes

Medoid voting may smooth away a rare but correct answer (average effect).

Automatic LLM judges can mislabel correct answers, skewing measured accuracy.

Core Entities

Models

gpt-4gpt-4-turbogpt-4-0125-previewgpt-3.5-turbogpt-3.5-turbo-16kclaude-2.1claude-3-opus-20240229gemini-1.5-pro-preview-0409mixtral-8x7B-instruct-v0.1mistral-8x7b-instruct

Metrics

Accuracyeffective-context-usageposition-depth performance

Datasets

HuggingFace Cosmopedia (story_forum)

Benchmarks

SWiMNIAH (needle-in-a-haystack)RULERLongBenchLong Range Arena

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Long-context models commonly perform worse when the answer document appears in the middle of the context window.

Medoid voting improves single-document QA accuracy by double-digit points on evaluated models.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding