Overview
Production Readiness
0.6
Novelty Score
0.7
Cost Impact Score
0.8
Citation Count
0
Why It Matters For Business
RAM‑Net cuts runtime memory traffic and per-token compute by activating far fewer memory entries, enabling longer contexts or cheaper inference without changing model size.
Summary TLDR
RAM-Net replaces fixed-size compressed state with a differentiable address decoder that maps low-dimensional keys/queries to high-dimensional, K-hot addresses. This gives a very large logical memory (M slots) while touching only K slots per token, reducing per-token active state from tens of millions to ~0.4M in experiments. The result: much better long-range retrieval (synthetic MQAR), comparable language-model quality, and large run-time memory savings. Key building blocks: Product Softmax (decomposed softmax), TopK truncation, Cyclic Address Positional Embedding (CAPE), and Power Decay Moving Average (PDMA).
Problem Statement
Linear attention keeps constant-size states for efficiency but compresses all history into a limited vector, causing interference and poor fine-grained long-range recall. Full attention keeps everything but costs linear memory. The paper asks: can we get high-fidelity retrieval like full attention while keeping constant, low per-step computation and no extra learnable parameters?
Main Contribution
A differentiable Address Decoder that maps dense keys/queries to high-dimensional, sparse K-hot addresses so memory capacity M scales independently of model parameters.
Product Softmax + TopK to create scalable, trainable sparse addresses with better gradient flow than global softmax.
Cyclic Address Positional Embedding (CAPE) to encode relative position into addresses and PDMA (Power Decay Moving Average) to decouple forgetting from write intensity.
Practical algorithms and kernels (beam-style TopK decoding, segment batching) to make sparse read/write efficient in training and inference.
Key Findings
RAM‑Net achieves dramatically lower per-token active state, cutting memory-bandwidth demand.
RAM‑Net consistently outperforms linear and other efficient architectures on fine-grained long-range retrieval (MQAR).
Language-modeling quality is competitive while saving runtime state: WikiText perplexity 32.33 and MMLU 23.3.
Decomposed Product Softmax order U improves optimization and retrieval as U increases.
Results
Active State per token
WikiText-103 perplexity (lower better)
Accuracy
Accuracy
Who Should Care
What To Try In 7 Days
Run a small RAM‑Net head in an existing linear-attention model: implement Product Softmax + TopK and compare active-state and retrieval on a synthetic MQAR.
Measure real GPU memory bandwidth impact: replace a layer with sparse read/write and track active-state reduction and latency.
Test CAPE on your autoregressive pipeline to see if cyclic positional addressing improves recall for time-sensitive retrieval tasks.
Agent Features
Memory
- selectively_addressable_memory
- large_M_scaling
- sparse_read_write
Frameworks
- Product Softmax
- CAPE
- PDMA
Architectures
- linear_attention
- sparse_memory
- explicit_addressing
Optimization Features
Token Efficiency
- activates only K slots per token (Top-8 used in experiments)
- active-state per token lowered to 0.4M in reported config
Infra Optimization
- specialized kernels for sparse read/write during training and inference
Model Optimization
- memory capacity scales without extra parameters
- decomposed product softmax improves gradient flow
System Optimization
- segment-based batching for training to aggregate sparse ops
- LRU-style caching of hot slots to keep full state off GPU
Training Optimization
- per-head scalar re-parameterization (dynamic softmax temperature) to speed convergence
- proxy gradients for PDMA to stabilize backprop
Inference Optimization
- sparse TopK read/write reduces per-step compute to O(K·d_v)
- log-domain beam search for TopK decoding reduces decoding complexity
Reproducibility
Data Available
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Requires specialized TopK decoding and sparse kernels; not plug-and-play on all runtimes.
- Total state M can still be large (reported total state 28.8M) and may need external storage/caching.
- Address decoding and PDMA introduce new hyperparameters (U, K, γ) that need tuning.
- Performance benefits depend on good U (product order); U=1 has severe gradient issues.
When Not To Use
- For short-context tasks where full attention is affordable and simpler.
- If target runtime cannot support efficient sparse kernels or random memory access.
- When you need best possible perplexity on current benchmarks and cannot trade accuracy for efficiency.
Failure Modes
- Gradient sparsity and instability if product softmax is not used (U=1).
- Poor retrieval if K (sparsity) is set incorrectly—too small loses signals, too large raises compute.
- System-level overheads (index decoding, beam search) may negate savings on small models.
Core Entities
Models
- RAM-Net
- Full Attention
- Linear Attention
Metrics
- Accuracy
- perplexity (WikiText-103)
- Active State per token
- Parameter count
Datasets
- FineWeb-Edu
- Zoology MQAR
Benchmarks
- MQAR
- WikiText-103
Context Entities
Models
- GLA
- DeltaNet
- Gated DeltaNet
- RWKV-7
- Mamba2
- H3
- Transformer++
- HGRN2
Metrics
- perplexity
- Accuracy
- active memory per token
Datasets
- WikiText-103
- MMLU
- ARC
- OpenbookQA
- SciQ
- COPA
- PIQA
- HellaSwag
- WinoGrande
Benchmarks
- Language Model Evaluation Harness
- Zoology framework MQAR

