A small GPT‑2 with recurrent memory reads 11 million tokens and finds facts big LLMs miss

February 16, 20248 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.6

Citation Count

7

Authors

Yuri Kuratov, Aydar Bulatov, Petr Anokhin, Dmitry Sorokin, Artyom Sorokin, Mikhail Burtsev

Links

Abstract / PDF

Why It Matters For Business

When you must locate rare facts across very long documents, memory‑augmented models scale better and are cheaper than relying on huge LLM windows or naive RAG, so consider memory models for long‑document search and auditing.

Summary TLDR

The authors introduce BABILong, a 'needle-in-a-haystack' benchmark that hides simple facts inside millions of tokens of book text. Off‑the‑shelf LLMs (GPT‑4, Mistral) and standard RAG struggle as context noise grows. Augmenting a small GPT‑2 (137M) with recurrent memory and trainable self‑retrieval (RMT / RMT‑R) and curriculum fine‑tuning lets it reliably retrieve and reason about facts up to ~11 million tokens — a new scaling record on this task.

Problem Statement

Modern transformers struggle to find and use a few task facts buried inside very long noisy documents because self‑attention costs explode and attention alone loses focus as irrelevant text grows.

Main Contribution

BABILong: a benchmark that embeds bAbI-style tasks inside arbitrarily long book text to test long-context fact retrieval and reasoning.

Systematic evaluation showing GPT‑4, Mistral, and RAG degrade as distracting text increases, especially past tens of thousands of tokens.

Extension of Recurrent Memory Transformer with self‑retrieval (RMT‑R) and demonstration that a fine‑tuned GPT‑2 (137M) with this memory reads up to ~11M tokens and outperforms much larger LLMs on these tasks.

Key Findings

Recurrent memory model processes record-length inputs.

NumbersProcessed up to ~11,000,000 tokens (paper claims 11M)

Large LLMs' accuracy falls as context noise grows.

NumbersGPT‑4 accuracy drops across tasks when context grows to 128k; fails on ~75% of the available window

Standard RAG retrieval is weak on temporally distributed facts.

NumbersSentence chunk retrieval degrades noticeably at 10M tokens; 512-token chunking performs worse earlier

Small model with memory generalizes far beyond its training length.

NumbersRMT trained on ~16k tokens (32 segments) retains quality up to 128k and outperforms LLM+RAG at 1M–10M tokens

Processing time scales linearly with context length.

Numbers1000 samples: 4K→4 min; 32K→30 min; 128K→80 min; 1M→315 min on one A100 80GB

Results

Max input length processed

Value≈11,000,000 tokens

BaselineGPT‑4 tested up to 128k tokens

Accuracy

ValueAccuracy declines as context → 128k; GPT‑4 fails on most cases when facts are far

BaselineNo-noise accuracy near 100% for some tasks

Retrieval top-5 recall

ValueHigh for sentence chunks until ~10M then drops; 512-token chunks worse earlier

BaselineSentence chunks vs 512-token chunks

Processing time scaling

Value4K→4 min; 32K→30 min; 128K→80 min; 1M→315 min per 1000 samples (A100 80Gb)

Who Should Care

What To Try In 7 Days

Run BABILong‑style stress tests on your document QA pipeline to measure sensitivity to noisy context.

Prototype RMT on a small GPT‑2 backbone for a narrow retrieval task using curriculum training.

Compare sentence vs fixed‑token chunking in your RAG pipeline; watch for temporal order sensitivity.

Agent Features

Memory

  • recurrent memory tokens (fixed-size per segment)
  • self-retrieval of past memory states
  • concatenated read/write memory tokens

Tool Use

  • FAISS
  • LangChain
  • OpenAI fine-tune API

Architectures

  • Recurrent Memory Transformer (RMT)
  • RMT with self-retrieval (RMT-R)

Optimization Features

Token Efficiency

  • linear compute and memory scaling with number of segments

Model Optimization

  • memory tokens to compress past segments (m ≪ L)

System Optimization

  • uses A100 80GB GPUs; fits RMT-R up to 10M tokens without hitting GPU memory in experiments

Training Optimization

  • curriculum training from 1 to 32 segments
  • AdamW with linear LR scheduling and warmup

Inference Optimization

  • segment-by-segment recurrent processing gives linear compute scaling

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Background text limited to PG19 and Wikipedia embeddings; other corpora may change difficulty.
  • RAG component not heavily optimized; prompts and retriever tuning were minimal.
  • RMT‑R stores all past memory states, so space scales linearly with segments and may hit limits in other setups.
  • Recurrent processing reduces parallelism and can increase wall‑clock latency compared to full‑attention models.

When Not To Use

  • When you need low-latency parallel inference across many requests.
  • If storage for past memory states is constrained and you cannot afford linear growth.
  • For tasks highly sensitive to domain shift from book-like background text without extra tuning.

Failure Modes

  • RAG misses temporally dependent supporting facts when retrieval ignores order.
  • RMT-R can hit memory limits if all past states are kept for extremely long sequences in constrained hardware.
  • Sequential nature of recurrence increases latency for single long‑document queries.
  • Models trained on synthetic bAbI-style facts may fail on richer semantic real‑world facts.

Core Entities

Models

  • Recurrent Memory Transformer (RMT)
  • RMT-R (RMT with self-retrieval)
  • GPT-2 (137M backbone)
  • GPT-4‑Turbo
  • Mistral (medium)
  • GPT-3.5 (fine-tuned)
  • textembedding-ada-002

Metrics

  • Accuracy
  • top-5 retrieval recall
  • processing time (minutes per 1000 samples)

Datasets

  • BABILong (this paper)
  • bAbI (algorithmic tasks)
  • PG19 (background books)
  • Wikipedia embeddings (Supabase/wikipedia-en-embeddings)

Benchmarks

  • BABILong
  • LongBench (comparison context)
  • Long Range Arena (LRA) (comparison)