Overview
The method plugs into existing inference stacks without training and shows consistent speedups on standard long-context benchmarks, but depends on Int4 Tensor Core support and per-head calibration.
Citations0
Evidence Strength0.80
Confidence0.80
Risk Signals9
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 5/5
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 80%
Production readiness: 70%
Novelty: 60%
Why It Matters For Business
SALE cuts attention compute for very long inputs with no model retraining, lowering inference cost and enabling cheaper long-document apps while fitting into existing inference stacks.
Who Should Care
Summary TLDR
SALE speeds up the prefilling (context) stage of long-context LLM inference by estimating attention weights element-wise using 4-bit quantized queries and keys, building block-sparse masks with a Relative Attention Score, then computing attention only on selected blocks. It needs no model training, adds about 11% extra overhead at large lengths, and delivers ≥3.36× attention-time speedup on Llama-3.1-8B for inputs ≥64K tokens while keeping model outputs nearly unchanged on standard long-context benchmarks.
Problem Statement
Full self-attention during the LLM prefilling stage costs quadratic compute with context length and becomes the bottleneck for very long inputs. Existing sparse-attention shortcuts inspect attention coarsely and lose accuracy. The paper asks: can we cheaply estimate element-wise importance to build block-sparse masks that cut compute significantly while preserving model quality?
Main Contribution
SALE: a training-free pipeline that inspects attention element-wise using 4-bit quantized Q/K, then forms block-sparse masks using a Relative Attention Score metric.
Per-head offline threshold calibration to control error vs sparsity and keep outputs close to full attention.
Key Findings
SALE cuts attention prefilling time by about 3.36× on Llama-3.1-8B for inputs ≥64K tokens.
Overall Selection+Quantization overhead falls with length and is roughly 11% of full attention for very long contexts.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Attention-time speedup (prefill) | 3.36× (Llama-3.1, 64K) | FlashAttention2 (1.00×) | ×3.36 | LongBench / Needle-In-A-Haystack (64K) | Table 1; Needle-In-A-Haystack results | Table 1; Figure 2 |
| Average LongBench score (Llama-3.1) | 48.39 (SALE) | 48.77 (FA2) | -0.38 | LongBench average | Table 1 (Average row) | Table 1 |
What To Try In 7 Days
Run SALE's C++/CUDA module on a copy of your long-context inference pipeline to measure attention-time savings.
Perform the per-head calibration (few minutes on RTX4090) to set τ and validate output quality on representative prompts.
If successful, deploy SALE for prefilling and monitor latency, cost, and any task-specific accuracy drift.
Optimization Features
Token Efficiency
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Requires GPUs with high-throughput Int4 tensor cores; benefit drops on hardware without fast 4-bit GEMM
Current implementation uses Int4 quantization; other low-bit formats (FP4/LUT GEMM) need adaptation
When Not To Use
On hardware lacking efficient 4-bit matrix multiplication
For short-context workloads where Selection overhead outweighs savings
Failure Modes
Quantization error may overestimate some attention contributions, selecting extra blocks and reducing speedup
Improper τ calibration can produce noticeable output deviation

