Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
0
Why It Matters For Business
TAPPA gives a cheap, model-side signal (q-similarity) to decide which parts of a model and which cached tokens are compressible. That can cut memory and latency for long-context inference and allow more aggressive structured pruning with less accuracy loss.
Summary TLDR
The paper gives a simple unifying idea: whether an attention head shows a stable pattern or unpredictable jumps depends on how similar its query vectors are over time. They formalize this with TAPPA and a q-similarity metric, prove when common patterns (re-access, sequential, seasonal, periodic-diagonals) arise, and show that q-similarity can guide practical tasks (KV cache compression and structured layer pruning) to get better compression with small or no accuracy loss.
Problem Statement
Prior work cataloged many attention-head patterns but lacked a single explanation and a practical, low-cost signal to pick which heads or layers are compressible.
Main Contribution
TAPPA: a time-series theory that links attention shapes to query temporal continuity (q-similarity) and RoPE channel behavior.
Mathematical conditions for three predictable attention patterns: re-access (sinks), sequential (diagonals), and seasonal/periodic patterns.
A simple per-layer/head metric (q-similarity) derived from TAPPA and shown to improve KV cache compression and structured layer pruning.
Empirical validation on multiple LLMs and benchmarks, with code released.
Key Findings
High q-similarity (smooth queries) predicts predictable attention heads; low q-similarity predicts retrieval-like, unpredictable heads.
Layer pruning guided by TAPPA (q-similarity + Block Influence) improves average task accuracy under the same pruning ratio.
Integrating q-similarity into KV budget allocation improves compression results and can substantially boost some baselines.
Results
q-similarity (per-head average)
Accuracy
KV cache compression (integration with Expected Attention)
Per-layer q-similarity overhead
Who Should Care
What To Try In 7 Days
Compute per-layer q-similarity (cosine of recent queries) on your model with a small calibration set.
Replace uniform KV budget allocation with q-similarity-adjusted layer budgets and measure end-to-end latency and accuracy on a few LongBench-like queries.
Use q-similarity combined with an existing layer importance metric (e.g., Block Influence) as a lightweight pruning proxy and test a small pruning ratio (10–30%) on downstream task
Optimization Features
Model Optimization
- structured layer pruning guided by q-similarity
System Optimization
- lower per-layer runtime and memory overhead for eviction signals
Inference Optimization
- KV cache compression using q-similarity-based layer budgets
- layer-wise budget allocation to prioritize unpredictable heads
Reproducibility
Data Urls
- https://github.com/MIRALab-USTC/LLM-TAPPA (configs)
- LongBench (Bai et al., 2024)
- GSM8K (Cobbe et al., 2021)
- PG19 (Rae et al., 2019)
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- q-similarity is model- and layer-dependent; per-model calibration is recommended.
- TAPPA focuses on predictable heads; unpredictable retrieval heads remain critical and are not compressible using this signal.
- The dominant-RoPE-channel assumption underlies some proofs and may not hold for every head.
- Experiments are reported on 7–8B models; behavior may differ at extreme scales or for non-RoPE positional schemes.
When Not To Use
- When most heads have low q-similarity (retrieval-heavy models) — q-similarity won't identify compressible parts.
- On models that do not use RoPE or use very different positional encodings without verifying the theory.
- When you cannot afford any risk of quality loss from pruning — requires careful calibration and validation.
Failure Modes
- Misclassifying retrieval heads as compressible leads to loss of critical context and degraded factuality.
- Over-reliance on layer-average q-similarity may hide important per-head variability.
- Hyperparameter mis-tuning (α, β) can reduce gains or increase errors; some validation is required.
Core Entities
Models
- Llama-3.1-8B
- Qwen2.5-7B
- Llama-2-7B
Metrics
- q-similarity (cosine)
- Accuracy
- KV budget sizes (512, 1024, 2048 tokens)
- Pruning ratio (%)
Datasets
- LongBench
- GSM8K
- AIGC
- PG19
- PIQA
- HellaSwag
- HotpotQA
- TriviaQA
Benchmarks
- LongBench
- GSM8K
- PIQA
- HellaSwag
- HotpotQA

