A small GPT‑2 with recurrent memory reads 11 million tokens and finds facts big LLMs miss

February 16, 20248 min

Overview

Decision SnapshotNeeds Validation

The paper provides clear experimental evidence on synthetic long‑context tasks and a public benchmark; the core claim (RMT handles millions of tokens) is supported, but real‑world domain shifts and parallelism/latency tradeoffs remain to be tested.

Citations7

Evidence Strength0.80

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 70%

Authors

Yuri Kuratov, Aydar Bulatov, Petr Anokhin, Dmitry Sorokin, Artyom Sorokin, Mikhail Burtsev

Links

Abstract / PDF / Code / Data

Why It Matters For Business

When you must locate rare facts across very long documents, memory‑augmented models scale better and are cheaper than relying on huge LLM windows or naive RAG, so consider memory models for long‑document search and auditing.

Who Should Care

Summary TLDR

The authors introduce BABILong, a 'needle-in-a-haystack' benchmark that hides simple facts inside millions of tokens of book text. Off‑the‑shelf LLMs (GPT‑4, Mistral) and standard RAG struggle as context noise grows. Augmenting a small GPT‑2 (137M) with recurrent memory and trainable self‑retrieval (RMT / RMT‑R) and curriculum fine‑tuning lets it reliably retrieve and reason about facts up to ~11 million tokens — a new scaling record on this task.

Problem Statement

Modern transformers struggle to find and use a few task facts buried inside very long noisy documents because self‑attention costs explode and attention alone loses focus as irrelevant text grows.

Main Contribution

BABILong: a benchmark that embeds bAbI-style tasks inside arbitrarily long book text to test long-context fact retrieval and reasoning.

Systematic evaluation showing GPT‑4, Mistral, and RAG degrade as distracting text increases, especially past tens of thousands of tokens.

Key Findings

Recurrent memory model processes record-length inputs.

NumbersProcessed up to ~11,000,000 tokens (paper claims 11M)

Practical UseIf you need to scan extremely long documents for a few facts, train a recurrent‑memory model; it scales linearly and can handle millions of tokens.

Evidence RefAbstract, Conclusions, Sec.5

Large LLMs' accuracy falls as context noise grows.

NumbersGPT‑4 accuracy drops across tasks when context grows to 128k; fails on ~75% of the available window

Practical UseDon't rely on off‑the‑shelf LLM context windows alone for high‑noise long‑document QA; they mainly use the first ~25% of the input.

Evidence RefFig.3, Sec.3, Appendix C

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Max input length processed≈11,000,000 tokensGPT‑4 tested up to 128k tokens>> baselineBABILong (needle-in-a-haystack)Paper reports RMT/RMT-R inference up to 11M tokensAbstract, Sec.5
AccuracyAccuracy declines as context → 128k; GPT‑4 fails on most cases when facts are farNo-noise accuracy near 100% for some tasksDrop to low accuracy across tasks as noise increases (Fig.3)BABILong qa1–qa5Fig.3 and Sec.3Fig.3

What To Try In 7 Days

Run BABILong‑style stress tests on your document QA pipeline to measure sensitivity to noisy context.

Prototype RMT on a small GPT‑2 backbone for a narrow retrieval task using curriculum training.

Compare sentence vs fixed‑token chunking in your RAG pipeline; watch for temporal order sensitivity.

Agent Features

Memory
recurrent memory tokens (fixed-size per segment)self-retrieval of past memory statesconcatenated read/write memory tokens
Tool Use
FAISSLangChainOpenAI fine-tune API
Architectures
Recurrent Memory Transformer (RMT)RMT with self-retrieval (RMT-R)

Optimization Features

Token Efficiency
linear compute and memory scaling with number of segments
Model Optimization
memory tokens to compress past segments (m ≪ L)
System Optimization
uses A100 80GB GPUs; fits RMT-R up to 10M tokens without hitting GPU memory in experiments
Training Optimization
curriculum training from 1 to 32 segmentsAdamW with linear LR scheduling and warmup
Inference Optimization
segment-by-segment recurrent processing gives linear compute scaling

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Background text limited to PG19 and Wikipedia embeddings; other corpora may change difficulty.

RAG component not heavily optimized; prompts and retriever tuning were minimal.

When Not To Use

When you need low-latency parallel inference across many requests.

If storage for past memory states is constrained and you cannot afford linear growth.

Failure Modes

RAG misses temporally dependent supporting facts when retrieval ignores order.

RMT-R can hit memory limits if all past states are kept for extremely long sequences in constrained hardware.

Core Entities

Models

Recurrent Memory Transformer (RMT)RMT-R (RMT with self-retrieval)GPT-2 (137M backbone)GPT-4‑TurboMistral (medium)GPT-3.5 (fine-tuned)textembedding-ada-002

Metrics

Accuracytop-5 retrieval recallprocessing time (minutes per 1000 samples)

Datasets

BABILong (this paper)bAbI (algorithmic tasks)PG19 (background books)Wikipedia embeddings (Supabase/wikipedia-en-embeddings)

Benchmarks

BABILongLongBench (comparison context)Long Range Arena (LRA) (comparison)