Overview
Production Readiness
0.8
Novelty Score
0.6
Cost Impact Score
0.8
Citation Count
0
Why It Matters For Business
FASA cuts GPU memory needs and memory bandwidth during long-input inference with almost no accuracy loss, lowering hosting costs and enabling long-context features on smaller hardware.
Summary TLDR
FASA is a training-free, two-stage method that predicts which past tokens matter for each query by inspecting a small set of RoPE frequency chunks (FCs). It uses those FCs to cheaply rank tokens (TIP), then runs full attention only on the top tokens (FAC). Across long-context benchmarks and long-generation tasks, FASA matches near-full KV performance while cutting memory and bandwidth: nearly 100% of full-KV on LongBench-V1 with 256 tokens, up to 8× KV memory reduction (FASA-M), and up to 2.56× end-to-end speedup in long-chain reasoning with small cache usage.
Problem Statement
Long inputs make KV caches huge and memory-bound. Existing token-eviction heuristics either lose information (static) or need costly training and still miss query-dependent importance. We need a cheap, query-aware way to keep only the tokens that actually matter during decoding.
Main Contribution
Discovered functional sparsity in RoPE: a tiny subset of frequency chunks ('dominant FCs') drives contextual attention.
Proposed FASA, a two-stage, training-free pipeline: TIP (cheap token scoring via dominant FCs) + FAC (full attention on selected tokens).
Offered two hardware-aware variants: FASA-M (memory-optimized, offloads non-critical KV parts to CPU) and FASA-C (computation-optimized, on-GPU sparse access).
Extensive evaluations showing near-oracle accuracy across long-context understanding, long-sequence modeling, and long chain-of-thought tasks.
Key Findings
Dominant FCs are extremely sparse: a tiny fraction of FCs explain contextual attention.
Dominant FC sets are stable across tasks and models.
FASA matches near-full-KV accuracy while cutting KV usage and latency.
Memory and speed savings: FASA-M achieves up to 8× KV memory reduction; FASA-C yields up to 2.56× decoding speedup.
Results
LongBench-V1 average (compared to full KV)
Decoding speedup (end-to-end)
KV memory compression (FASA-M)
Accuracy
Who Should Care
What To Try In 7 Days
Run the one-time offline FC calibration on your model with a small calibration set (paper used a single sample).
Apply the provided FASA monkey-patch to FlashAttention2 and benchmark latency and KV memory on a 16k–32k workload.
If VRAM is tight, test FASA-M to offload non-dominant KV parts to CPU and measure end-to-end generation latency with prefetching enabled.
Optimization Features
Token Efficiency
- query-aware top-k token selection
- TIP: low-dim scoring using dominant FCs
Infra Optimization
- just-in-time CPU→GPU transfers (FASA-M)
- sparse on-GPU key access (FASA-C)
Model Optimization
- low-dimensional FC selection
System Optimization
- reduced memory bandwidth
- integration with FlashAttention2
- compatibility with PyramidKV
Inference Optimization
- token selection (sparse attention)
- reduced KV reads
- GPU-CPU offload (FASA-M)
Reproducibility
Data Urls
- LongBench (public)
- MATH500 (public)
- AIME24 (public)
- PG-19 / WikiText / C4 (public)
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Relies on positional encodings that expose FC structure (RoPE-like). non-RoPE models need validation though ALiBi and Partial-RoPE showed compatibility.
- FASA-M adds CPU↔GPU transfers which need careful prefetching to avoid latency regressions.
- TIP scores are a selector, not a substitute for attention weights; replacing attention directly causes failure.
When Not To Use
- Short-context workloads where KV cache is not a bottleneck.
- Setups that cannot tolerate any risk of dropped rare tokens (extremely safety-critical outputs).
- Models with positional encodings that do not show FC-like functional sparsity and where calibration fails.
Failure Modes
- Misidentifying important tokens under rare, atypical queries causes significant accuracy loss.
- Replacing full attention with FC-proxy scores (instead of selecting tokens) yields catastrophic degradation.
- Incorrect offline calibration (too few samples or wrong model checkpoint) can pick suboptimal FCs.
Core Entities
Models
- Llama-3.2-3B
- Llama-3.1-8B
- Mistral-7B-v0.3
- Qwen2.5-7B
- Qwen2.5-14B
- Qwen2.5-32B
- R1-Distill-Llama-8B
- DeepSeek-R1 variants
Metrics
- f1
- ROUGE
- perplexity (PPL)
- pass@1
- speedup
- compression ratio
Datasets
- LongBench-V1
- Qasper
- GovReport
- NarrativeQA
- 2Wikimqa
- PG-19
- WikiText
- C4
- MATH500
- AIME24
Benchmarks
- LongBench
- MATH
- AIME

