Scale memory capacity without extra parameters using sparse high‑dimensional addresses

February 12, 20267 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.8

Citation Count

0

Authors

Kaicheng Xiao, Haotian Li, Liran Dong, Guoliang Xing

Links

Abstract / PDF

Why It Matters For Business

RAM‑Net cuts runtime memory traffic and per-token compute by activating far fewer memory entries, enabling longer contexts or cheaper inference without changing model size.

Summary TLDR

RAM-Net replaces fixed-size compressed state with a differentiable address decoder that maps low-dimensional keys/queries to high-dimensional, K-hot addresses. This gives a very large logical memory (M slots) while touching only K slots per token, reducing per-token active state from tens of millions to ~0.4M in experiments. The result: much better long-range retrieval (synthetic MQAR), comparable language-model quality, and large run-time memory savings. Key building blocks: Product Softmax (decomposed softmax), TopK truncation, Cyclic Address Positional Embedding (CAPE), and Power Decay Moving Average (PDMA).

Problem Statement

Linear attention keeps constant-size states for efficiency but compresses all history into a limited vector, causing interference and poor fine-grained long-range recall. Full attention keeps everything but costs linear memory. The paper asks: can we get high-fidelity retrieval like full attention while keeping constant, low per-step computation and no extra learnable parameters?

Main Contribution

A differentiable Address Decoder that maps dense keys/queries to high-dimensional, sparse K-hot addresses so memory capacity M scales independently of model parameters.

Product Softmax + TopK to create scalable, trainable sparse addresses with better gradient flow than global softmax.

Cyclic Address Positional Embedding (CAPE) to encode relative position into addresses and PDMA (Power Decay Moving Average) to decouple forgetting from write intensity.

Practical algorithms and kernels (beam-style TopK decoding, segment batching) to make sparse read/write efficient in training and inference.

Key Findings

RAM‑Net achieves dramatically lower per-token active state, cutting memory-bandwidth demand.

NumbersActive state per token: RAM‑Net 0.4M vs Transformer++ 50.3M (Table 1/2)

RAM‑Net consistently outperforms linear and other efficient architectures on fine-grained long-range retrieval (MQAR).

Language-modeling quality is competitive while saving runtime state: WikiText perplexity 32.33 and MMLU 23.3.

NumbersWikiText ppl 32.33 (RAM‑Net) vs best baseline 28.79; MMLU acc 23.3 (Table 1)

Decomposed Product Softmax order U improves optimization and retrieval as U increases.

Results

Active State per token

Value0.4M (RAM‑Net Top-8)

BaselineTransformer++ 50.3M

WikiText-103 perplexity (lower better)

Value32.33 (RAM‑Net)

BaselineBest baseline 28.79 (Gated DeltaNet)

Accuracy

Value23.3% (RAM‑Net)

Baseline23.2% (HGRN2) and others ~23%

Accuracy

ValueRAM‑Net highest across tested state sizes

BaselineFull attention, linear attention, sliding window, GLA, Mamba2, etc.

Who Should Care

What To Try In 7 Days

Run a small RAM‑Net head in an existing linear-attention model: implement Product Softmax + TopK and compare active-state and retrieval on a synthetic MQAR.

Measure real GPU memory bandwidth impact: replace a layer with sparse read/write and track active-state reduction and latency.

Test CAPE on your autoregressive pipeline to see if cyclic positional addressing improves recall for time-sensitive retrieval tasks.

Agent Features

Memory

  • selectively_addressable_memory
  • large_M_scaling
  • sparse_read_write

Frameworks

  • Product Softmax
  • CAPE
  • PDMA

Architectures

  • linear_attention
  • sparse_memory
  • explicit_addressing

Optimization Features

Token Efficiency

  • activates only K slots per token (Top-8 used in experiments)
  • active-state per token lowered to 0.4M in reported config

Infra Optimization

  • specialized kernels for sparse read/write during training and inference

Model Optimization

  • memory capacity scales without extra parameters
  • decomposed product softmax improves gradient flow

System Optimization

  • segment-based batching for training to aggregate sparse ops
  • LRU-style caching of hot slots to keep full state off GPU

Training Optimization

  • per-head scalar re-parameterization (dynamic softmax temperature) to speed convergence
  • proxy gradients for PDMA to stabilize backprop

Inference Optimization

  • sparse TopK read/write reduces per-step compute to O(K·d_v)
  • log-domain beam search for TopK decoding reduces decoding complexity

Reproducibility

Data Available

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Requires specialized TopK decoding and sparse kernels; not plug-and-play on all runtimes.
  • Total state M can still be large (reported total state 28.8M) and may need external storage/caching.
  • Address decoding and PDMA introduce new hyperparameters (U, K, γ) that need tuning.
  • Performance benefits depend on good U (product order); U=1 has severe gradient issues.

When Not To Use

  • For short-context tasks where full attention is affordable and simpler.
  • If target runtime cannot support efficient sparse kernels or random memory access.
  • When you need best possible perplexity on current benchmarks and cannot trade accuracy for efficiency.

Failure Modes

  • Gradient sparsity and instability if product softmax is not used (U=1).
  • Poor retrieval if K (sparsity) is set incorrectly—too small loses signals, too large raises compute.
  • System-level overheads (index decoding, beam search) may negate savings on small models.

Core Entities

Models

  • RAM-Net
  • Full Attention
  • Linear Attention

Metrics

  • Accuracy
  • perplexity (WikiText-103)
  • Active State per token
  • Parameter count

Datasets

  • FineWeb-Edu
  • Zoology MQAR

Benchmarks

  • MQAR
  • WikiText-103

Context Entities

Models

  • GLA
  • DeltaNet
  • Gated DeltaNet
  • RWKV-7
  • Mamba2
  • H3
  • Transformer++
  • HGRN2

Metrics

  • perplexity
  • Accuracy
  • active memory per token

Datasets

  • WikiText-103
  • MMLU
  • ARC
  • OpenbookQA
  • SciQ
  • COPA
  • PIQA
  • HellaSwag
  • WinoGrande

Benchmarks

  • Language Model Evaluation Harness
  • Zoology framework MQAR