Use 4-bit QK estimates plus block-sparse masks to speed up long-context LLM prefilling with minimal quality loss

May 30, 20257 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.8

Citation Count

0

Authors

Xiaodong Ji, Hailin Zhang, Fangcheng Fu, Bin Cui

Links

Abstract / PDF

Why It Matters For Business

SALE cuts attention compute for very long inputs with no model retraining, lowering inference cost and enabling cheaper long-document apps while fitting into existing inference stacks.

Summary TLDR

SALE speeds up the prefilling (context) stage of long-context LLM inference by estimating attention weights element-wise using 4-bit quantized queries and keys, building block-sparse masks with a Relative Attention Score, then computing attention only on selected blocks. It needs no model training, adds about 11% extra overhead at large lengths, and delivers ≥3.36× attention-time speedup on Llama-3.1-8B for inputs ≥64K tokens while keeping model outputs nearly unchanged on standard long-context benchmarks.

Problem Statement

Full self-attention during the LLM prefilling stage costs quadratic compute with context length and becomes the bottleneck for very long inputs. Existing sparse-attention shortcuts inspect attention coarsely and lose accuracy. The paper asks: can we cheaply estimate element-wise importance to build block-sparse masks that cut compute significantly while preserving model quality?

Main Contribution

SALE: a training-free pipeline that inspects attention element-wise using 4-bit quantized Q/K, then forms block-sparse masks using a Relative Attention Score metric.

Per-head offline threshold calibration to control error vs sparsity and keep outputs close to full attention.

CUDA kernel optimizations (reduced dequantization, transformed comparisons) and integration with low-bit QKV execution to reduce Selection-Pass overhead to ~11% at large contexts.

Open-source implementation and empirical validation on Llama-3.1-8B and Qwen-2.5-32B across multiple long-context benchmarks.

Key Findings

SALE cuts attention prefilling time by about 3.36× on Llama-3.1-8B for inputs ≥64K tokens.

Numbers≥3.36× speedup (64K, Table 1)

Overall Selection+Quantization overhead falls with length and is roughly 11% of full attention for very long contexts.

NumbersOverhead ratio 23.9%→11.1% (8K→128K, Table 3)

Using 4-bit Q/K to estimate attention preserves quality while lowering inspection cost compared to full-precision inspection.

NumbersQuantization times: 11–208 ms across 8K–128K; ablation shows full-precision inspection raises overhead (Fig. 5)

Per-head threshold calibration is important for best trade-offs and requires only minutes offline.

NumbersCalibration time ≈ 5 minutes on an RTX4090 server (Appendix C)

Results

Attention-time speedup (prefill)

Value3.36× (Llama-3.1, 64K)

BaselineFlashAttention2 (1.00×)

Average LongBench score (Llama-3.1)

Value48.39 (SALE)

Baseline48.77 (FA2)

Average LongBench score (Qwen-2.5)

Value51.30 (SALE)

Baseline50.85 (FA2)

Selection+Quantization overhead ratio

Value11.1% (128K); 23.9% (8K)

BaselineFull attention (FA2)

Computation-Pass speedup vs full attention

Value5.57× (64K); 6.87× (128K)

BaselineFull attention

Who Should Care

What To Try In 7 Days

Run SALE's C++/CUDA module on a copy of your long-context inference pipeline to measure attention-time savings.

Perform the per-head calibration (few minutes on RTX4090) to set τ and validate output quality on representative prompts.

If successful, deploy SALE for prefilling and monitor latency, cost, and any task-specific accuracy drift.

Optimization Features

Token Efficiency

  • Block-level skipping reduces per-token attention compute

Infra Optimization

  • Relies on GPUs with high-throughput Int4/TensorCore support (e.g., RTX4090)

Model Optimization

  • No model weights changed; training-free method

System Optimization

  • CUDA kernel tricks: single-element dequantization, transformed comparisons, grouped reductions

Training Optimization

  • Not applicable

Inference Optimization

  • 4-bit Q/K estimation for element-wise importance
  • Block-sparse attention compute only on selected blocks
  • Integration with low-bit QKV execution (SageAttention)

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Requires GPUs with high-throughput Int4 tensor cores; benefit drops on hardware without fast 4-bit GEMM
  • Current implementation uses Int4 quantization; other low-bit formats (FP4/LUT GEMM) need adaptation
  • Per-head offline calibration is needed and slightly model/hardware specific

When Not To Use

  • On hardware lacking efficient 4-bit matrix multiplication
  • For short-context workloads where Selection overhead outweighs savings
  • Where exact full-attention outputs are required without any approximation

Failure Modes

  • Quantization error may overestimate some attention contributions, selecting extra blocks and reducing speedup
  • Improper τ calibration can produce noticeable output deviation
  • Implementation complexity and GPU-specific kernels may hinder deployment on some infra

Core Entities

Models

  • Llama-3.1-8B-Instruct
  • Qwen-2.5-32B-Instruct

Metrics

  • attention latency speedup
  • Accuracy
  • attention sparsity
  • L1 output error

Datasets

  • LongBench
  • InfiniteBench
  • Needle-In-A-Haystack

Benchmarks

  • LongBench
  • InfiniteBench
  • Needle-In-A-Haystack