Use a CPU suffix-automaton to recall distant tokens and let windowed attention match global-attention quality

January 14, 20268 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.6

Citation Count

0

Authors

Yunao Zheng, Xiaojie Wang, Lei Ren, Wei Chen

Links

Abstract / PDF

Why It Matters For Business

ROSA lets teams support very long inputs (documents, multi-turn chat history) with near-global-attention accuracy while avoiding large GPU memory and compute increases, lowering inference cost for long-context applications.

Summary TLDR

ROSA-Tuning adds a cheap CPU-side retrieval module (ROSA: suffix automata over binarized hidden bits) that finds relevant past positions and injects their values into a pretrained model. On Qwen3-Base-1.7B, this restores most long-context ability of windowed-attention models (close to global attention on LongBench and MQAR) while keeping GPU compute and memory similar to sliding-window attention by running retrieval on CPU and using an async CPU–GPU pipeline.

Problem Statement

Windowed and other efficient attention schemes cut GPU cost but often miss important past tokens because the model state does not cover all relevant history. Global attention recovers those tokens but costs O(T^2) compute and high GPU memory. The paper seeks a low-cost way to recall specific historical positions and feed them into the model without expanding GPU attention state.

Main Contribution

ROSA-Tuning: a retrieval-and-recall pathway using many small suffix automata (SAMs) over binarized per-route symbols to locate relevant past positions and inject their values into the model.

A binary discretization and counterfactual-gradient training method to backpropagate through discrete retrieval decisions.

An asynchronous CPU–GPU execution design that runs retrieval on CPU in parallel with GPU attention to hide most extra latency.

Systematic evaluation on Qwen3-Base-1.7B showing near-global-attention accuracy on long-context benchmarks while keeping GPU memory and attention complexity like windowed attention.

Key Findings

ROSA largely recovers long-context accuracy lost by windowed attention.

NumbersLongBench AVG: Global 59.21; Window 29.41; Window+ROSA 57.14

General capabilities remain effectively unchanged after ROSA-Tuning.

Numberslm-eval AVG: Global 0.71; Window+ROSA 0.705

CPU retrieval overhead is small and can be hidden under attention compute.

NumbersFlashAttention compute cost was 9.39x–18.23x higher than a single SAM (Figure 2)

ROSA greatly speeds up learning of match-based retrieval tasks.

NumbersMQAR: ROSA+Window-Attn reached ~100% validation accuracy by epoch 5 while Global-Attn reached 61.2 by epoch 7 and Window

Results

LongBench AVG (higher is better)

Value57.14 (Window+ROSA) vs 59.21 (Global) vs 29.41 (Window)

BaselineGlobal-Attn and Window-Attn

lm-eval average (higher is better)

Value0.705 (Window+ROSA) vs 0.71 (Global)

BaselineGlobal-Attn

Accuracy

Value100% by epoch 5 (ROSA+Window-Attn)

BaselineGlobal-Attn (61.2% at epoch 7), Window-Attn (~2–3%)

Compute overhead (single SAM vs FlashAttention kernel)

ValueFlashAttention compute cost 9.39x–18.23x higher than a single SAM

BaselineFlashAttention (GPU) vs SAM (CPU)

GPU memory footprint

ValueEssentially the same as windowed-attention

BaselineWindowed attention

Who Should Care

What To Try In 7 Days

Run the ROSA repo on a windowed-attention Qwen3 checkpoint and reproduce LongBench or a target long-input task.

Benchmark end-to-end latency and GPU memory with post-attn (async) pipeline to measure hidden CPU overhead.

Compare post-attn (async) vs pre-attn quality on a small validation set to trade throughput for best quality.

Optimization Features

Token Efficiency

  • binary discretization of hidden features
  • run-length encoding (RLE) to fold repeated symbols

Infra Optimization

  • keeps GPU memory like windowed attention (states on CPU)

System Optimization

  • per-route independent retrieval to scale across CPU cores

Training Optimization

  • counterfactual gradient through discrete retrieval
  • adapter warm-up then full fine-tuning

Inference Optimization

  • CPU-side parallel SAM retrieval
  • asynchronous CPU–GPU pipeline to overlap retrieval with attention

Reproducibility

Data Urls

  • LongBench (Bai et al., 2024)
  • MQAR (Arora et al., 2023)
  • PG19 (Rae et al., 2019)
  • lm-eval-harness (Biderman et al., 2024)

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Authors did not train to full convergence; reported gains are on the presented checkpoints only.
  • Throughput and end-to-end latency depend on CPU cores, memory bandwidth, and implementation details.
  • Pre-attn fusion gives slightly better quality but requires retrieval to finish before attention, increasing overhead.
  • Binary discretization can cause collisions; recovery depends on chosen route width M and balanced bit distributions.

When Not To Use

  • When your deployment has no spare CPU cores or cannot tolerate any CPU–GPU transfers.
  • When you already run global attention and can afford its GPU cost for your sequence lengths.
  • When you need deterministic behavior unaffected by discretization-induced retrieval misses.

Failure Modes

  • Retrieval returns no valid destination (τ = -1) and yields zero injection for that route.
  • Symbol collisions or poor discretization can cause spurious matches and wrong recall.
  • Ignoring within-run boundary effects (RLE) may miss fine-grained positional info.
  • If counterfactual gradients are not implemented, training can be unstable or diverge.

Core Entities

Models

  • Qwen3-Base-1.7B
  • Qwen3-0.6B (ablation)

Metrics

  • LongBench AVG
  • lm-eval AVG
  • Accuracy
  • Perplexity (PPL)

Datasets

  • LongBench
  • MQAR
  • PG19
  • lm-eval-harness

Benchmarks

  • LongBench
  • MQAR
  • lm-eval-harness