A small GPT‑2 with recurrent memory reads 11 million tokens and finds facts big LLMs miss

Overview

Decision SnapshotNeeds Validation

The paper provides clear experimental evidence on synthetic long‑context tasks and a public benchmark; the core claim (RMT handles millions of tokens) is supported, but real‑world domain shifts and parallelism/latency tradeoffs remain to be tested.

Citations7

Evidence Strength0.80

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 70%

Authors

Yuri Kuratov, Aydar Bulatov, Petr Anokhin, Dmitry Sorokin, Artyom Sorokin, Mikhail Burtsev

Links

Abstract / PDF / Code / Data

Why It Matters For Business

When you must locate rare facts across very long documents, memory‑augmented models scale better and are cheaper than relying on huge LLM windows or naive RAG, so consider memory models for long‑document search and auditing.

Who Should Care

CTO ML Engineer Engineering Lead Product Manager Data Scientist

Summary TLDR

The authors introduce BABILong, a 'needle-in-a-haystack' benchmark that hides simple facts inside millions of tokens of book text. Off‑the‑shelf LLMs (GPT‑4, Mistral) and standard RAG struggle as context noise grows. Augmenting a small GPT‑2 (137M) with recurrent memory and trainable self‑retrieval (RMT / RMT‑R) and curriculum fine‑tuning lets it reliably retrieve and reason about facts up to ~11 million tokens — a new scaling record on this task.

Problem Statement

Modern transformers struggle to find and use a few task facts buried inside very long noisy documents because self‑attention costs explode and attention alone loses focus as irrelevant text grows.

Main Contribution

BABILong: a benchmark that embeds bAbI-style tasks inside arbitrarily long book text to test long-context fact retrieval and reasoning.

Systematic evaluation showing GPT‑4, Mistral, and RAG degrade as distracting text increases, especially past tens of thousands of tokens.

Key Findings

Recurrent memory model processes record-length inputs.

NumbersProcessed up to ~11,000,000 tokens (paper claims 11M)

Practical UseIf you need to scan extremely long documents for a few facts, train a recurrent‑memory model; it scales linearly and can handle millions of tokens.

Evidence RefAbstract, Conclusions, Sec.5

Large LLMs' accuracy falls as context noise grows.

NumbersGPT‑4 accuracy drops across tasks when context grows to 128k; fails on ~75% of the available window

Practical UseDon't rely on off‑the‑shelf LLM context windows alone for high‑noise long‑document QA; they mainly use the first ~25% of the input.

Evidence RefFig.3, Sec.3, Appendix C

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Max input length processed	≈11,000,000 tokens	GPT‑4 tested up to 128k tokens	>> baseline	BABILong (needle-in-a-haystack)	Paper reports RMT/RMT-R inference up to 11M tokens	Abstract, Sec.5
Accuracy	Accuracy declines as context → 128k; GPT‑4 fails on most cases when facts are far	No-noise accuracy near 100% for some tasks	Drop to low accuracy across tasks as noise increases (Fig.3)	BABILong qa1–qa5	Fig.3 and Sec.3	Fig.3

What To Try In 7 Days

Run BABILong‑style stress tests on your document QA pipeline to measure sensitivity to noisy context.

Prototype RMT on a small GPT‑2 backbone for a narrow retrieval task using curriculum training.

Compare sentence vs fixed‑token chunking in your RAG pipeline; watch for temporal order sensitivity.

Agent Features

Memory

recurrent memory tokens (fixed-size per segment)self-retrieval of past memory statesconcatenated read/write memory tokens

Tool Use

FAISSLangChainOpenAI fine-tune API

Architectures

Recurrent Memory Transformer (RMT)RMT with self-retrieval (RMT-R)

Optimization Features

Token Efficiency

linear compute and memory scaling with number of segments

Model Optimization

memory tokens to compress past segments (m ≪ L)

System Optimization

uses A100 80GB GPUs; fits RMT-R up to 10M tokens without hitting GPU memory in experiments

Training Optimization

curriculum training from 1 to 32 segmentsAdamW with linear LR scheduling and warmup

Inference Optimization

segment-by-segment recurrent processing gives linear compute scaling

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/booydar/babilong

Data URLs

PG19 dataset (used as background)https://huggingface.co/datasets/Supabase/wikipedia-en-embeddings (wiki embeddings used)

Risks & Boundaries

Limitations

Background text limited to PG19 and Wikipedia embeddings; other corpora may change difficulty.

RAG component not heavily optimized; prompts and retriever tuning were minimal.

When Not To Use

When you need low-latency parallel inference across many requests.

If storage for past memory states is constrained and you cannot afford linear growth.

Failure Modes

RAG misses temporally dependent supporting facts when retrieval ignores order.

RMT-R can hit memory limits if all past states are kept for extremely long sequences in constrained hardware.

Core Entities

Models

Recurrent Memory Transformer (RMT)RMT-R (RMT with self-retrieval)GPT-2 (137M backbone)GPT-4‑TurboMistral (medium)GPT-3.5 (fine-tuned)textembedding-ada-002

Metrics

Accuracytop-5 retrieval recallprocessing time (minutes per 1000 samples)

Datasets

BABILong (this paper)bAbI (algorithmic tasks)PG19 (background books)Wikipedia embeddings (Supabase/wikipedia-en-embeddings)

Benchmarks

BABILongLongBench (comparison context)Long Range Arena (LRA) (comparison)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Recurrent memory model processes record-length inputs.

Large LLMs' accuracy falls as context noise grows.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding