SWiM: a working-memory test exposes 'lost-in-the-middle' and fixes it with cheap medoid voting

July 4, 20246 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

2

Authors

Amanda Dsouza, Christopher Glaze, Changho Shin, Frederic Sala

Links

Abstract / PDF

Why It Matters For Business

SWiM finds real-world failure modes that synthetic tests miss; medoid voting is a cheap, production-friendly fix that can raise QA accuracy without retraining.

Summary TLDR

The authors introduce SWiM, a practical evaluation pipeline that benchmarks long-context LLMs on realistic document QA tasks and document-position effects. SWiM finds many long-context models lose accuracy when the answer is placed in the middle of the context window ('lost-in-the-middle'). They propose medoid voting: generate several outputs by randomly permuting document order, embed outputs, and pick the medoid. Medoid voting is training-free and improves single-document QA accuracy by up to 24.2 percentage points on evaluated models. Code and data pointers are provided.

Problem Statement

Standard long-context tests (like needle-in-a-haystack) use synthetic needles or toy tasks and miss practical failure modes. Practitioners need an end-to-end way to test models on their real documents, measure position and noise effects, and apply low-cost fixes when answers get 'lost in the middle'.

Main Contribution

SWiM: an end-to-end, customizable evaluation pipeline for long-context QA on user documents.

Empirical finding: many long-context models drop accuracy when the answer sits midway in the context (lost-in-the-middle).

Medoid voting: a training-free inference-time correction that selects a consensus output across random document permutations.

Key Findings

Long-context models commonly perform worse when the answer document appears in the middle of the context window.

Medoid voting improves single-document QA accuracy by double-digit points on evaluated models.

NumbersGPT-4-Turbo +17.3 pts; GPT-3.5-Turbo-16k +24.2 pts

NIAH-style tests can give misleadingly optimistic results compared to SWiM on realistic QA tasks.

Results

Accuracy

Valuebaseline -> medoid voting

Baselinemodel performance at 25% doc depth

Who Should Care

What To Try In 7 Days

Run SWiM on a representative set of your documents to measure position sensitivity.

Implement medoid voting: 3–5 random permutations, embed outputs, pick medoid.

Filter 'no information' responses before medoid selection to avoid averaging away correct answers.

Agent Features

Memory

  • short-term (context window testing)

Frameworks

  • SWiM

Optimization Features

Inference Optimization

  • medoid voting (inference-time aggregation)

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • SWiM focuses on single-document QA and may not cover multi-document reasoning or citation tasks.
  • Evaluation uses GPT-generated QA and Cosmopedia stories; results may shift on other domains.
  • LLM-as-a-judge can be inconsistent; the framework requires human validation to ensure label quality.
  • Medoid voting can reduce best-case performance by averaging unless filtered.

When Not To Use

  • When your task requires multi-document reasoning or precise citation linking.
  • When outputs must be deterministic and averaging across permutations is unacceptable.

Failure Modes

  • Medoid voting may smooth away a rare but correct answer (average effect).
  • Automatic LLM judges can mislabel correct answers, skewing measured accuracy.
  • Model behavior may differ outside the tested dataset and document formats.

Core Entities

Models

  • gpt-4
  • gpt-4-turbo
  • gpt-4-0125-preview
  • gpt-3.5-turbo
  • gpt-3.5-turbo-16k
  • claude-2.1
  • claude-3-opus-20240229
  • gemini-1.5-pro-preview-0409
  • mixtral-8x7B-instruct-v0.1
  • mistral-8x7b-instruct

Metrics

  • Accuracy
  • effective-context-usage
  • position-depth performance

Datasets

  • HuggingFace Cosmopedia (story_forum)

Benchmarks

  • SWiM
  • NIAH (needle-in-a-haystack)
  • RULER
  • LongBench
  • Long Range Arena