A3: component-aware low-rank compression for Transformers that cuts model size, KV cache and FLOPs with no runtime overhead

May 19, 20258 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.8

Citation Count

0

Authors

Jeffrey T. H. Wong, Cheng Zhang, Xinye Cao, Pedro Gimenes, George A. Constantinides, Wayne Luk, Yiren Zhao

Links

Abstract / PDF

Why It Matters For Business

A3 reduces inference cost and memory (including KV cache) without adding runtime work, so you can lower cloud GPU spend and serve larger models at similar latency while preserving or improving accuracy on common benchmarks.

Summary TLDR

A3 is a post-training low-rank compression method that splits a Transformer layer into three functional parts—QK (query-key), OV (output-value), and MLP—and finds analytical low-rank approximations that minimize each component's functional error. The method reduces model parameters, KV cache size, and FLOPs while keeping the same GEMM structure (no extra small-matrix GEMMs). A3 supports common variants (RoPE, GQA), combines with quantization, and matches or improves state-of-the-art low-rank baselines: e.g., at 10% compression A3 compresses LLaMA-3.1-70B to PPL 4.69 on WikiText-2 vs SVD-LLM's 7.87. The approach is calibration-based, works without fine-tuning, and is practical for inference-­

Problem Statement

Existing low-rank methods treat each linear layer in isolation and often decompose weights into extra small matrices. That gives modest savings and added runtime overhead. The problem is how to compress Transformers in a way that (1) directly optimizes attention and MLP functional errors, (2) reduces KV cache and FLOPs, and (3) avoids extra runtime GEMMs or memory ops.

Main Contribution

Three-part decomposition (QK, OV, MLP) and functional objectives that target attention scores, attention outputs, and MLP outputs.

Closed‑form analytical solutions for QK and OV low-rank approximations; CUR-based selection for MLP and RoPE-adapted attention.

Applies to Transformer variants (RoPE, GQA) and keeps same GEMM count at smaller sizes, so no extra runtime kernel launches.

Demonstrates strong empirical gains across models and datasets; compatible with weight-only quantization and mixed-rank allocation.

Key Findings

On WikiText-2 at 10% compression, A3 on LLaMA-3.1-70B achieves perplexity 4.69 versus SVD-LLM's 7.87.

NumbersPPL 4.69 vs 7.87 (∆ -3.18, -58.6% relative)

On LLaMA-2-7B (10% CR), A3 yields lower perplexity than SVD-LLM (5.96 vs 8.78).

NumbersPPL 5.96 vs 8.78 (∆ -2.82)

A3 increases inference throughput compared to SVD-LLM without adding extra GEMM kernels.

A3 is compatible with weight-only quantization and mixed-rank allocation with small extra degradation.

Results

perplexity (WikiText-2)

Value4.69 (A3, LLaMA-3.1-70B, 10% CR)

Baseline7.87 (SVD-LLM, same model and CR)

Accuracy

Value0.7508 (A3, LLaMA-3.1-70B, 10% CR)

Baseline0.6797 (SVD-LLM, same)

inference throughput (TPS)

ValueA3 shows consistent speedup vs SVD-LLM (LLaMA-2-13B, A100)

BaselineSVD-LLM throughput

Who Should Care

What To Try In 7 Days

Calibrate A3 on 128 sequences from your data and apply to a single decoder-only model layer set to 10% compression to measure PPL and TPS.

Measure tokens/sec before/after on representative hardware to confirm throughput gains.

Combine A3 with your existing 4-bit quantizer and check end-to-end quality; expect small extra degradation per paper results.

Optimization Features

Token Efficiency

  • improves tokens/sec in prefilling profiles vs SVD-LLM

Infra Optimization

  • supports higher throughput on GPU backends without extra kernel launches

Model Optimization

  • reduces hidden head dimensions (d_qk, d_vo) and MLP intermediate size
  • low-rank per-component approximations (analytical SVD and CUR)

System Optimization

  • reduces memory footprint and FLOPs for attention and MLP

Training Optimization

  • post-training only; no further fine-tuning required

Inference Optimization

  • keeps same number of GEMMs but with smaller shapes (no extra GEMMs)
  • cuts KV cache size proportionally to rank reduction

Reproducibility

Data Urls

  • WikiText-2
  • C4
  • SlimPajama
  • PTB

Data Available

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • CUR-based steps (MLP and RoPE) do not guarantee SVD-level optimality and degrade faster at high compression.
  • Calibration selection matters; overfitting calibration can bias results (paper shows SlimPajama vs WikiText-2 differences).
  • Independence assumption between query and key inputs weakens in deeper layers, which may affect QK approximation accuracy.
  • For compression >~20% the method can lose quality and retraining may be required.

When Not To Use

  • When you need very aggressive compression (>20%) without retraining.
  • If you lack representative calibration data for autocorrelation estimates.
  • If your deployment strictly forbids any runtime indexing or small additional kernel work required for RoPE adaptations.

Failure Modes

  • Large perplexity degradation at high compression ratios due to CUR suboptimality.
  • KV-cache may increase if using the fused OV overall solution with an insufficient rank selection.
  • Calibration overfitting leading to inconsistent downstream task accuracy.

Core Entities

Models

  • LLaMA-3.1-70B
  • LLaMA-3.1-8B
  • LLaMA-2-13B
  • LLaMA-2-7B
  • MPT-7B
  • MosaicML MPT family (reference)

Metrics

  • perplexity
  • Accuracy
  • tokens/sec (TPS)

Datasets

  • WikiText-2
  • C4
  • SlimPajama
  • PTB (used in calibration mixture)

Benchmarks

  • ARC-C
  • BoolQ
  • Winogrande
  • GSM8K
  • MMLU