A linear-attention LLM that matches or beats Transformers while running faster and using less memory
TransNormerLLM can lower compute and memory needs for long-context LLM training and serving while keeping or improving accuracy, letting teams run larger contexts or reduce hardware costs without sacrificing model quality.
Key finding
TransNormerLLM yields lower perplexity than Transformer baselines at small and medium scales.
Numbers: 385M model: PPL 4.77 vs Transformer 5.16; 1B model: PPL 3.729 vs Transformer 4.765

