Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.8
Citation Count
0
Why It Matters For Business
SALE cuts attention compute for very long inputs with no model retraining, lowering inference cost and enabling cheaper long-document apps while fitting into existing inference stacks.
Summary TLDR
SALE speeds up the prefilling (context) stage of long-context LLM inference by estimating attention weights element-wise using 4-bit quantized queries and keys, building block-sparse masks with a Relative Attention Score, then computing attention only on selected blocks. It needs no model training, adds about 11% extra overhead at large lengths, and delivers ≥3.36× attention-time speedup on Llama-3.1-8B for inputs ≥64K tokens while keeping model outputs nearly unchanged on standard long-context benchmarks.
Problem Statement
Full self-attention during the LLM prefilling stage costs quadratic compute with context length and becomes the bottleneck for very long inputs. Existing sparse-attention shortcuts inspect attention coarsely and lose accuracy. The paper asks: can we cheaply estimate element-wise importance to build block-sparse masks that cut compute significantly while preserving model quality?
Main Contribution
SALE: a training-free pipeline that inspects attention element-wise using 4-bit quantized Q/K, then forms block-sparse masks using a Relative Attention Score metric.
Per-head offline threshold calibration to control error vs sparsity and keep outputs close to full attention.
CUDA kernel optimizations (reduced dequantization, transformed comparisons) and integration with low-bit QKV execution to reduce Selection-Pass overhead to ~11% at large contexts.
Open-source implementation and empirical validation on Llama-3.1-8B and Qwen-2.5-32B across multiple long-context benchmarks.
Key Findings
SALE cuts attention prefilling time by about 3.36× on Llama-3.1-8B for inputs ≥64K tokens.
Overall Selection+Quantization overhead falls with length and is roughly 11% of full attention for very long contexts.
Using 4-bit Q/K to estimate attention preserves quality while lowering inspection cost compared to full-precision inspection.
Per-head threshold calibration is important for best trade-offs and requires only minutes offline.
Results
Attention-time speedup (prefill)
Average LongBench score (Llama-3.1)
Average LongBench score (Qwen-2.5)
Selection+Quantization overhead ratio
Computation-Pass speedup vs full attention
Who Should Care
What To Try In 7 Days
Run SALE's C++/CUDA module on a copy of your long-context inference pipeline to measure attention-time savings.
Perform the per-head calibration (few minutes on RTX4090) to set τ and validate output quality on representative prompts.
If successful, deploy SALE for prefilling and monitor latency, cost, and any task-specific accuracy drift.
Optimization Features
Token Efficiency
- Block-level skipping reduces per-token attention compute
Infra Optimization
- Relies on GPUs with high-throughput Int4/TensorCore support (e.g., RTX4090)
Model Optimization
- No model weights changed; training-free method
System Optimization
- CUDA kernel tricks: single-element dequantization, transformed comparisons, grouped reductions
Training Optimization
- Not applicable
Inference Optimization
- 4-bit Q/K estimation for element-wise importance
- Block-sparse attention compute only on selected blocks
- Integration with low-bit QKV execution (SageAttention)
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Requires GPUs with high-throughput Int4 tensor cores; benefit drops on hardware without fast 4-bit GEMM
- Current implementation uses Int4 quantization; other low-bit formats (FP4/LUT GEMM) need adaptation
- Per-head offline calibration is needed and slightly model/hardware specific
When Not To Use
- On hardware lacking efficient 4-bit matrix multiplication
- For short-context workloads where Selection overhead outweighs savings
- Where exact full-attention outputs are required without any approximation
Failure Modes
- Quantization error may overestimate some attention contributions, selecting extra blocks and reducing speedup
- Improper τ calibration can produce noticeable output deviation
- Implementation complexity and GPU-specific kernels may hinder deployment on some infra
Core Entities
Models
- Llama-3.1-8B-Instruct
- Qwen-2.5-32B-Instruct
Metrics
- attention latency speedup
- Accuracy
- attention sparsity
- L1 output error
Datasets
- LongBench
- InfiniteBench
- Needle-In-A-Haystack
Benchmarks
- LongBench
- InfiniteBench
- Needle-In-A-Haystack

