A tiny synthetic benchmark shows Transformers make sporadic, hard-to-fix memory errors

June 1, 20238 min

Overview

Production Readiness

0.35

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

8

Authors

Bingbin Liu, Jordan T. Ash, Surbhi Goel, Akshay Krishnamurthy, Cyril Zhang

Links

Abstract / PDF

Why It Matters For Business

Sporadic, rare reasoning failures in Transformers can surface as hard-to-detect errors in production; fixing them needs better data coverage or architecture changes, not only hyperparameter tuning.

Summary TLDR

The authors introduce FFLM, a small synthetic benchmark that tests whether sequence models can reliably copy a single bit across long spans. Transformers repeatedly show a long tail of sporadic "attention glitches" (random read errors) on rare sequences, while small LSTMs extrapolate perfectly. Remedies such as training on longer-tailed data and attention-sharpening reduce errors by orders of magnitude but do not fully remove them. The paper releases data and pinpoints architectural failure modes that matter for closed-domain hallucinations.

Problem Statement

Why do Transformer-based language models sometimes output deterministic-but-wrong results? The paper isolates one minimal memory task—copying a single bit across a sequence—and asks whether modern Transformers learn a robust, perfectly reliable retrieval operation or instead make sporadic 'attention glitches'.

Main Contribution

FFLM: a parametric synthetic benchmark (flip-flop language) that isolates one-bit memory retrieval over long contexts.

Empirical finding that Transformers show a long tail of sporadic read errors on FFLM, while small LSTMs achieve perfect extrapolation.

A large intervention study showing data diversity, scale, and attention-sharpening reduce but do not eliminate glitches.

Preliminary mechanistic analyses identifying attention dilution and fragile positional tiebreaking as failure modes.

Public release of the FFLM datasets for reproducibility.

Key Findings

Transformers exhibit a long, irregular tail of sporadic read errors (attention glitches) on FFLM.

NumbersObserved across 10,625 Transformer runs; many nonzero o.o.d. glitch rates

Small recurrent models (LSTM) extrapolate perfectly on the same task.

Numbers1-layer LSTM got 0% o.o.d. error in 100/100 runs with 20× less data/steps

Training on long-tailed (rare) sequences nearly eliminates glitches.

NumbersMixture training (p_i={0.9,0.98,0.1}): 6/25 runs had 0 errors

Attention-sharpening regularizers and some dropout settings reduce glitch rates by orders of magnitude.

NumbersAttention sharpening reduces errors roughly 10× on sparse sequences in best runs

No tested method fully removed glitches for Transformers across sparse and dense tails simultaneously.

NumbersExtensive hyperparameter search (>10k runs) yielded nonzero glitch rates except with tail training or recurrence

Two mechanistic failure modes identified: attention dilution and brittle positional tiebreaking.

NumbersTheory (Proposition 3,4) and empirical attention drift in long tests (Figures 14,16)

Results

o.o.d. read error (Transformer baseline)

Valuelong tail of nonzero errors across seeds

Baseline19M-parameter 6-layer 8-head Transformer

o.o.d. read error (LSTM)

Value0% error

Baseline1-layer LSTM (133K params), 500 steps

error after training on mixed tail data

Valuesubstantially reduced; 6/25 runs reached 0 errors

BaselineBaseline Transformer

effect of attention-sharpening regularizer

Value≈10× reduction on sparse sequences in best configs

BaselineUnregularized Transformer

Who Should Care

What To Try In 7 Days

Run FFLM (public dataset) on your models to measure tail glitch rates.

Retrain or finetune with explicit rare-case (long-tail) examples from FFLM-like distributions.

Add attention-sparsity penalties and increase embedding dropout to reduce glitch frequency quickly.

Optimization Features

Training Optimization

  • train on long-tailed mixture of sequences
  • attention-sharpening regularizers

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • FFLM is synthetic: conclusions about natural-language hallucinations are suggestive but not conclusive.
  • Mechanistic attribution to attention glitches in large pretrained LLMs is hypothesized but not empirically proven for those models.
  • Some mitigations were evaluated only at moderate model scale (19M baseline); behavior can change at larger scales.

When Not To Use

  • Do not use FFLM as the sole test for open-domain factuality or world-knowledge hallucinations.
  • Do not assume attention-sharpening fixes all failure modes across arbitrary downstream tasks.

Failure Modes

  • Attention dilution: softmax spreads weight across many positions as length increases.
  • Non-commutative tiebreaking: brittle positional margins cause confident wrong selections.
  • Optimization/local minima: sparsity regularizers can push models into wrong hard-attention solutions.
  • Data-coverage dependence: rare sequences cause failure unless included in training.

Core Entities

Models

  • Transformer (self-attention)
  • LSTM (recurrent)

Metrics

  • glitch rate (o.o.d. read errors)
  • Accuracy

Datasets

  • FFL (flip-flop) synthetic dataset (T=512, p_i=0.8)
  • O.O.D. tails: FFL(p_i=0.98 sparse), FFL(p_i=0.1 dense)

Benchmarks

  • FFLM (flip-flop language modeling)

Context Entities

Models

  • GPT-2 family, GPT-4, Pythia-12C, GPT-NeoX-20B (few-shot FFLM tests)

Metrics

  • Accuracy
  • o.o.d. tail error amplification

Datasets

  • Public FFLM release: https://huggingface.co/datasets/synthseq/flipflop

Benchmarks

  • Long Range Arena (comparison mentioned)
  • BIG-Bench (related tests)