3 papers found

A linear-attention LLM that matches or beats Transformers while running faster and using less memory

0.60
0.70
0.80
5

TransNormerLLM can lower compute and memory needs for long-context LLM training and serving while keeping or improving accuracy, letting teams run larger contexts or reduce hardware costs without sacrificing model quality.

Key finding

TransNormerLLM yields lower perplexity than Transformer baselines at small and medium scales.

Numbers: 385M model: PPL 4.77 vs Transformer 5.16; 1B model: PPL 3.729 vs Transformer 4.765

A3: component-aware low-rank compression for Transformers that cuts model size, KV cache and FLOPs with no runtime overhead

0.70
0.60
0.80
0

A3 reduces inference cost and memory (including KV cache) without adding runtime work, so you can lower cloud GPU spend and serve larger models at similar latency while preserving or improving accuracy on common benchmarks.

Key finding

On WikiText-2 at 10% compression, A3 on LLaMA-3.1-70B achieves perplexity 4.69 versus SVD-LLM's 7.87.

Numbers: PPL 4.69 vs 7.87 (∆ -3.18, -58.6% relative)

Scale memory capacity without extra parameters using sparse high‑dimensional addresses

0.60
0.70
0.80
0

RAM‑Net cuts runtime memory traffic and per-token compute by activating far fewer memory entries, enabling longer contexts or cheaper inference without changing model size.

Key finding

RAM‑Net achieves dramatically lower per-token active state, cutting memory-bandwidth demand.

Numbers: Active state per token: RAM‑Net 0.4M vs Transformer++ 50.3M (Table 1/2)