Cut KV-cache costs by predicting important tokens from RoPE frequency chunks

Overview

Production Readiness

0.8

Novelty Score

0.6

Cost Impact Score

0.8

Citation Count

Authors

Yifei Wang, Yueqi Wang, Zhenrui Yue, Huimin Zeng, Yong Wang, Ismini Lourentzou, Zhengzhong Tu, Xiangxiang Chu, Julian McAuley

Links

Abstract / PDF

Why It Matters For Business

FASA cuts GPU memory needs and memory bandwidth during long-input inference with almost no accuracy loss, lowering hosting costs and enabling long-context features on smaller hardware.

Summary TLDR

FASA is a training-free, two-stage method that predicts which past tokens matter for each query by inspecting a small set of RoPE frequency chunks (FCs). It uses those FCs to cheaply rank tokens (TIP), then runs full attention only on the top tokens (FAC). Across long-context benchmarks and long-generation tasks, FASA matches near-full KV performance while cutting memory and bandwidth: nearly 100% of full-KV on LongBench-V1 with 256 tokens, up to 8× KV memory reduction (FASA-M), and up to 2.56× end-to-end speedup in long-chain reasoning with small cache usage.

Problem Statement

Long inputs make KV caches huge and memory-bound. Existing token-eviction heuristics either lose information (static) or need costly training and still miss query-dependent importance. We need a cheap, query-aware way to keep only the tokens that actually matter during decoding.

Main Contribution

Discovered functional sparsity in RoPE: a tiny subset of frequency chunks ('dominant FCs') drives contextual attention.

Proposed FASA, a two-stage, training-free pipeline: TIP (cheap token scoring via dominant FCs) + FAC (full attention on selected tokens).

Offered two hardware-aware variants: FASA-M (memory-optimized, offloads non-critical KV parts to CPU) and FASA-C (computation-optimized, on-GPU sparse access).

Extensive evaluations showing near-oracle accuracy across long-context understanding, long-sequence modeling, and long chain-of-thought tasks.

Key Findings

Dominant FCs are extremely sparse: a tiny fraction of FCs explain contextual attention.

NumbersDominant FCs ≤ 0.8% vs non-dominant ≈ 89–95% (Table 9)

Dominant FC sets are stable across tasks and models.

NumbersCross-task dominant-FC overlap ≈ 70–87% across datasets (Table 10)

FASA matches near-full-KV accuracy while cutting KV usage and latency.

NumbersNearly 100% of full-KV on LongBench-V1 with K=256; <0.7% avg drop vs full-KV (Intro, Table 2)

Memory and speed savings: FASA-M achieves up to 8× KV memory reduction; FASA-C yields up to 2.56× decoding speedup.

Numbers8× compression (FASA-M); 2.56× speedup with 18.9% cache on AIME24 (Abstract, D.1, Fig.7)

Results

LongBench-V1 average (compared to full KV)

Value≈100% of full-KV when keeping 256 tokens

BaselineFKV (full KV)

Decoding speedup (end-to-end)

Value2.56×

BaselineFKV

KV memory compression (FASA-M)

Value8× reduction

Baselinefull KV on-GPU

Accuracy

Value<0.7% loss

BaselineFKV

Who Should Care

CtoMl EngineerEngineering LeadFounderData ScientistProduct Manager

What To Try In 7 Days

Run the one-time offline FC calibration on your model with a small calibration set (paper used a single sample).

Apply the provided FASA monkey-patch to FlashAttention2 and benchmark latency and KV memory on a 16k–32k workload.

If VRAM is tight, test FASA-M to offload non-dominant KV parts to CPU and measure end-to-end generation latency with prefetching enabled.

Optimization Features

Token Efficiency

query-aware top-k token selection
TIP: low-dim scoring using dominant FCs

Infra Optimization

just-in-time CPU→GPU transfers (FASA-M)
sparse on-GPU key access (FASA-C)

Model Optimization

low-dimensional FC selection

System Optimization

reduced memory bandwidth
integration with FlashAttention2
compatibility with PyramidKV

Inference Optimization

token selection (sparse attention)
reduced KV reads
GPU-CPU offload (FASA-M)

Reproducibility

Code Urls

https://github.com/AMAP-ML/FASA-ICLR2026

Data Urls

LongBench (public)
MATH500 (public)
AIME24 (public)
PG-19 / WikiText / C4 (public)

Code Available

Data Available

Open Source Status

partial

Risks & Boundaries

Limitations

Relies on positional encodings that expose FC structure (RoPE-like). non-RoPE models need validation though ALiBi and Partial-RoPE showed compatibility.
FASA-M adds CPU↔GPU transfers which need careful prefetching to avoid latency regressions.
TIP scores are a selector, not a substitute for attention weights; replacing attention directly causes failure.

When Not To Use

Short-context workloads where KV cache is not a bottleneck.
Setups that cannot tolerate any risk of dropped rare tokens (extremely safety-critical outputs).
Models with positional encodings that do not show FC-like functional sparsity and where calibration fails.

Failure Modes

Misidentifying important tokens under rare, atypical queries causes significant accuracy loss.
Replacing full attention with FC-proxy scores (instead of selecting tokens) yields catastrophic degradation.
Incorrect offline calibration (too few samples or wrong model checkpoint) can pick suboptimal FCs.

Core Entities

Models

Llama-3.2-3B
Llama-3.1-8B
Mistral-7B-v0.3
Qwen2.5-7B
Qwen2.5-14B
Qwen2.5-32B
R1-Distill-Llama-8B
DeepSeek-R1 variants

Metrics

f1
ROUGE
perplexity (PPL)
pass@1
speedup
compression ratio

Datasets

LongBench-V1
Qasper
GovReport
NarrativeQA
2Wikimqa
PG-19
WikiText
C4
MATH500
AIME24

Benchmarks

LongBench
MATH
AIME