Cut KV-cache costs by predicting important tokens from RoPE frequency chunks

February 3, 20267 min

Overview

Production Readiness

0.8

Novelty Score

0.6

Cost Impact Score

0.8

Citation Count

0

Authors

Yifei Wang, Yueqi Wang, Zhenrui Yue, Huimin Zeng, Yong Wang, Ismini Lourentzou, Zhengzhong Tu, Xiangxiang Chu, Julian McAuley

Links

Abstract / PDF

Why It Matters For Business

FASA cuts GPU memory needs and memory bandwidth during long-input inference with almost no accuracy loss, lowering hosting costs and enabling long-context features on smaller hardware.

Summary TLDR

FASA is a training-free, two-stage method that predicts which past tokens matter for each query by inspecting a small set of RoPE frequency chunks (FCs). It uses those FCs to cheaply rank tokens (TIP), then runs full attention only on the top tokens (FAC). Across long-context benchmarks and long-generation tasks, FASA matches near-full KV performance while cutting memory and bandwidth: nearly 100% of full-KV on LongBench-V1 with 256 tokens, up to 8× KV memory reduction (FASA-M), and up to 2.56× end-to-end speedup in long-chain reasoning with small cache usage.

Problem Statement

Long inputs make KV caches huge and memory-bound. Existing token-eviction heuristics either lose information (static) or need costly training and still miss query-dependent importance. We need a cheap, query-aware way to keep only the tokens that actually matter during decoding.

Main Contribution

Discovered functional sparsity in RoPE: a tiny subset of frequency chunks ('dominant FCs') drives contextual attention.

Proposed FASA, a two-stage, training-free pipeline: TIP (cheap token scoring via dominant FCs) + FAC (full attention on selected tokens).

Offered two hardware-aware variants: FASA-M (memory-optimized, offloads non-critical KV parts to CPU) and FASA-C (computation-optimized, on-GPU sparse access).

Extensive evaluations showing near-oracle accuracy across long-context understanding, long-sequence modeling, and long chain-of-thought tasks.

Key Findings

Dominant FCs are extremely sparse: a tiny fraction of FCs explain contextual attention.

NumbersDominant FCs ≤ 0.8% vs non-dominant ≈ 89–95% (Table 9)

Dominant FC sets are stable across tasks and models.

NumbersCross-task dominant-FC overlap ≈ 70–87% across datasets (Table 10)

FASA matches near-full-KV accuracy while cutting KV usage and latency.

NumbersNearly 100% of full-KV on LongBench-V1 with K=256; <0.7% avg drop vs full-KV (Intro, Table 2)

Memory and speed savings: FASA-M achieves up to 8× KV memory reduction; FASA-C yields up to 2.56× decoding speedup.

Numbers8× compression (FASA-M); 2.56× speedup with 18.9% cache on AIME24 (Abstract, D.1, Fig.7)

Results

LongBench-V1 average (compared to full KV)

Value≈100% of full-KV when keeping 256 tokens

BaselineFKV (full KV)

Decoding speedup (end-to-end)

Value2.56×

BaselineFKV

KV memory compression (FASA-M)

Value8× reduction

Baselinefull KV on-GPU

Accuracy

Value<0.7% loss

BaselineFKV

Who Should Care

What To Try In 7 Days

Run the one-time offline FC calibration on your model with a small calibration set (paper used a single sample).

Apply the provided FASA monkey-patch to FlashAttention2 and benchmark latency and KV memory on a 16k–32k workload.

If VRAM is tight, test FASA-M to offload non-dominant KV parts to CPU and measure end-to-end generation latency with prefetching enabled.

Optimization Features

Token Efficiency

  • query-aware top-k token selection
  • TIP: low-dim scoring using dominant FCs

Infra Optimization

  • just-in-time CPU→GPU transfers (FASA-M)
  • sparse on-GPU key access (FASA-C)

Model Optimization

  • low-dimensional FC selection

System Optimization

  • reduced memory bandwidth
  • integration with FlashAttention2
  • compatibility with PyramidKV

Inference Optimization

  • token selection (sparse attention)
  • reduced KV reads
  • GPU-CPU offload (FASA-M)

Reproducibility

Data Urls

  • LongBench (public)
  • MATH500 (public)
  • AIME24 (public)
  • PG-19 / WikiText / C4 (public)

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Relies on positional encodings that expose FC structure (RoPE-like). non-RoPE models need validation though ALiBi and Partial-RoPE showed compatibility.
  • FASA-M adds CPU↔GPU transfers which need careful prefetching to avoid latency regressions.
  • TIP scores are a selector, not a substitute for attention weights; replacing attention directly causes failure.

When Not To Use

  • Short-context workloads where KV cache is not a bottleneck.
  • Setups that cannot tolerate any risk of dropped rare tokens (extremely safety-critical outputs).
  • Models with positional encodings that do not show FC-like functional sparsity and where calibration fails.

Failure Modes

  • Misidentifying important tokens under rare, atypical queries causes significant accuracy loss.
  • Replacing full attention with FC-proxy scores (instead of selecting tokens) yields catastrophic degradation.
  • Incorrect offline calibration (too few samples or wrong model checkpoint) can pick suboptimal FCs.

Core Entities

Models

  • Llama-3.2-3B
  • Llama-3.1-8B
  • Mistral-7B-v0.3
  • Qwen2.5-7B
  • Qwen2.5-14B
  • Qwen2.5-32B
  • R1-Distill-Llama-8B
  • DeepSeek-R1 variants

Metrics

  • f1
  • ROUGE
  • perplexity (PPL)
  • pass@1
  • speedup
  • compression ratio

Datasets

  • LongBench-V1
  • Qasper
  • GovReport
  • NarrativeQA
  • 2Wikimqa
  • PG-19
  • WikiText
  • C4
  • MATH500
  • AIME24

Benchmarks

  • LongBench
  • MATH
  • AIME