A time-series view explains why transformer attention heads show stable or random patterns and uses that signal to compress KV caches and to

January 29, 20267 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

0

Authors

Qingyue Yang, Jie Wang, Xing Li, Yinqi Bai, Xialiang Tong, Huiling Zhen, Jianye Hao, Mingxuan Yuan, Bin Li

Links

Abstract / PDF

Why It Matters For Business

TAPPA gives a cheap, model-side signal (q-similarity) to decide which parts of a model and which cached tokens are compressible. That can cut memory and latency for long-context inference and allow more aggressive structured pruning with less accuracy loss.

Summary TLDR

The paper gives a simple unifying idea: whether an attention head shows a stable pattern or unpredictable jumps depends on how similar its query vectors are over time. They formalize this with TAPPA and a q-similarity metric, prove when common patterns (re-access, sequential, seasonal, periodic-diagonals) arise, and show that q-similarity can guide practical tasks (KV cache compression and structured layer pruning) to get better compression with small or no accuracy loss.

Problem Statement

Prior work cataloged many attention-head patterns but lacked a single explanation and a practical, low-cost signal to pick which heads or layers are compressible.

Main Contribution

TAPPA: a time-series theory that links attention shapes to query temporal continuity (q-similarity) and RoPE channel behavior.

Mathematical conditions for three predictable attention patterns: re-access (sinks), sequential (diagonals), and seasonal/periodic patterns.

A simple per-layer/head metric (q-similarity) derived from TAPPA and shown to improve KV cache compression and structured layer pruning.

Empirical validation on multiple LLMs and benchmarks, with code released.

Key Findings

High q-similarity (smooth queries) predicts predictable attention heads; low q-similarity predicts retrieval-like, unpredictable heads.

Numbersavg q-similarity ≈ 0.80 (Llama-3.1) and ≈ 0.86 (Qwen2.5) on evaluated datasets

Layer pruning guided by TAPPA (q-similarity + Block Influence) improves average task accuracy under the same pruning ratio.

NumbersLlama-3.1-8B at 28% pruning: avg 53.51 → 59.11 (Δ +5.60) on evaluated benchmarks

Integrating q-similarity into KV budget allocation improves compression results and can substantially boost some baselines.

NumbersQwen-2.5-7B, 512 KV: avg 24.25 → 35.59 (≈+46.8%) when adding TAPPA allocation to Expected Attention

Results

q-similarity (per-head average)

ValueLlama-3.1 ≈ 0.80; Qwen2.5 ≈ 0.86 (on GSM8K/AIGC)

Accuracy

ValueLlama-3.1-8B avg 59.11 after TAPPA-guided pruning

BaselineShortGPT avg 53.51

KV cache compression (integration with Expected Attention)

ValueQwen2.5-7B avg 35.59 with TAPPA allocation at 512 KV

BaselineExpected Attention avg 24.25

Per-layer q-similarity overhead

ValueLatency <0.2 ms per layer; extra memory ~8.7 MB

BaselineCAKE latency up to 0.874 ms and memory 520 MB at 32K

Who Should Care

What To Try In 7 Days

Compute per-layer q-similarity (cosine of recent queries) on your model with a small calibration set.

Replace uniform KV budget allocation with q-similarity-adjusted layer budgets and measure end-to-end latency and accuracy on a few LongBench-like queries.

Use q-similarity combined with an existing layer importance metric (e.g., Block Influence) as a lightweight pruning proxy and test a small pruning ratio (10–30%) on downstream task

Optimization Features

Model Optimization

  • structured layer pruning guided by q-similarity

System Optimization

  • lower per-layer runtime and memory overhead for eviction signals

Inference Optimization

  • KV cache compression using q-similarity-based layer budgets
  • layer-wise budget allocation to prioritize unpredictable heads

Reproducibility

Data Urls

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • q-similarity is model- and layer-dependent; per-model calibration is recommended.
  • TAPPA focuses on predictable heads; unpredictable retrieval heads remain critical and are not compressible using this signal.
  • The dominant-RoPE-channel assumption underlies some proofs and may not hold for every head.
  • Experiments are reported on 7–8B models; behavior may differ at extreme scales or for non-RoPE positional schemes.

When Not To Use

  • When most heads have low q-similarity (retrieval-heavy models) — q-similarity won't identify compressible parts.
  • On models that do not use RoPE or use very different positional encodings without verifying the theory.
  • When you cannot afford any risk of quality loss from pruning — requires careful calibration and validation.

Failure Modes

  • Misclassifying retrieval heads as compressible leads to loss of critical context and degraded factuality.
  • Over-reliance on layer-average q-similarity may hide important per-head variability.
  • Hyperparameter mis-tuning (α, β) can reduce gains or increase errors; some validation is required.

Core Entities

Models

  • Llama-3.1-8B
  • Qwen2.5-7B
  • Llama-2-7B

Metrics

  • q-similarity (cosine)
  • Accuracy
  • KV budget sizes (512, 1024, 2048 tokens)
  • Pruning ratio (%)

Datasets

  • LongBench
  • GSM8K
  • AIGC
  • PG19
  • PIQA
  • HellaSwag
  • HotpotQA
  • TriviaQA

Benchmarks

  • LongBench
  • GSM8K
  • PIQA
  • HellaSwag
  • HotpotQA