FlashInfer: a JIT‑compiled, block‑sparse attention engine that cuts LLM inference latency and supports custom attention variants

January 2, 20257 min

Overview

Decision SnapshotReady For Pilot

Clear kernel and end‑to‑end speedups on A100/H100 with multiple serving systems. Works today on NVIDIA stacks but needs extra integration work and careful workspace sizing.

Citations2

Evidence Strength0.85

Confidence0.90

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 75%

Production readiness: 80%

Novelty: 60%

Authors

Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, Luis Ceze

Links

Abstract / PDF / Code / Data

Why It Matters For Business

FlashInfer can cut inference latency and increase throughput in production LLM services, lowering GPU costs per query and improving user responsiveness.

Who Should Care

Summary TLDR

FlashInfer is an attention engine for LLM inference that (1) uses a flexible block‑sparse KV‑cache representation and composable formats to save memory and exploit shared prefixes, (2) provides a JIT compiler so users can plug in custom attention variants without hand‑writing kernels, and (3) runs a runtime load‑balanced scheduler that adapts to variable sequence lengths while remaining CUDA‑Graph compatible. On NVIDIA A100/H100 hardware and standard serving setups, FlashInfer cuts inter‑token latency by 29–69% vs. a Triton backend, reduces long‑context latency by ~28–30%, and speeds parallel generation by 13–17%. Code is open source.

Problem Statement

Existing attention kernels either assume dense, fixed layouts or target a narrow set of workloads. Real LLM serving mixes variable sequence lengths, shared prefixes, and many attention variants, causing memory inefficiency, poor kernel utilization, and load imbalance. The paper presents a single engine that unifies KV storage, supports many attention variants with JIT compilation, and schedules work dynamically for inference serving.

Main Contribution

Unified block‑sparse KV‑cache format and composable formats to represent diverse KV storage layouts and shared prefixes.

A customizable attention template plus JIT compiler that emits optimized CUDA/CUTLASS kernels for many attention variants.

Key Findings

FlashInfer reduces inter‑token latency versus a Triton backend in LLM serving

Numbers2969% ITL reduction (Sec. 4, Abstract)

Practical UseDrop per‑token latency by up to ~2/3; try FlashInfer as a backend to speed online text generation.

Evidence RefAbstract; Sec.4 (Fig.7)

FlashInfer lowers latency for long‑context inference when kernels are fused

Numbers2830% ITL reduction for Streaming‑LLM with fused RoPE (Sec.4.3, Fig.9)

Practical UseFor million‑token or long‑context use cases, implement fused query/key transforms in FlashInfer to cut latency by ~30%.

Evidence RefSec.4.3 (Fig.9)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Inter‑Token Latency (ITL) vs Triton backend2969% reductionSGLang + Triton2969% lower ITLShareGPT, Llama 3.1 8B/70B (Sec.4.1)Sec.4.1, Fig.7Fig.7
Long‑context Streaming‑LLM latency2830% reductionUnfused kernels / FlashAttention2830% lower ITLStreaming‑LLM on Vicuna‑13B (Sec.4.3)Sec.4.3, Fig.9Fig.9

What To Try In 7 Days

Replace attention backend in a staging vLLM/SGLang deployment with FlashInfer to measure ITL/TTFT changes.

JIT‑compile a fused attention+RoPE kernel for a long‑context pipeline and test end‑to‑end latency.

Enable composable formats for a parallel‑generation workload with shared prefixes and measure TTFT/ITL gains.

Optimization Features

Token Efficiency
Supports fine‑grained KV pruning and vector sparsity
Infra Optimization
TMA use on Hopper for contiguous loads (where applicable)Asynchronous host->device plan uploads and pinned host buffers
System Optimization
CUTLASS/CUDA templates and tile selection heuristicsCUDA Graphs compatibility for fixed grid replay
Inference Optimization
Block‑sparse KV‑cache (arbitrary (Br,Bc) support)Composable formats for shared prefixesJIT‑compiled attention templates for custom variantsDynamic load‑balanced scheduling for variable sequencesFP8 KV‑cache with mixed‑precision attention

Reproducibility

Risks & Boundaries

Limitations

Forward‑only attention kernels; no training/backward support yet.

Designed and evaluated on NVIDIA GPUs (CUDA/CUTLASS); portability to other hardware is future work.

When Not To Use

When you need attention backward kernels for training today.

On non‑NVIDIA hardware or backends where CUTLASS/TMA are unavailable.

Failure Modes

Workspace underallocation if planner upper bounds are too low, causing runtime failures.

Performance regressions when sparse gathering falls back to less efficient copies (e.g., certain FA3 paths).

Core Entities

Models

Llama 3.1 8BLlama 3.1 70BVicuna-13B

Metrics

Inter‑Token Latency (ITL)Time‑To‑First‑Token (TTFT)Throughput (tokens/s)TFLOPs / bandwidth utilization

Datasets

ShareGPTMT-BenchSynthetic variable workload (512–2048)

Benchmarks

LLM serving benchmark (ITL/TTFT)AttentionGym (variant tests)Streaming‑LLM long‑context benchmark