FlashInfer: a JIT‑compiled, block‑sparse attention engine that cuts LLM inference latency and supports custom attention variants

Overview

Decision SnapshotReady For Pilot

Clear kernel and end‑to‑end speedups on A100/H100 with multiple serving systems. Works today on NVIDIA stacks but needs extra integration work and careful workspace sizing.

Citations2

Evidence Strength0.85

Confidence0.90

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 75%

Production readiness: 80%

Novelty: 60%

Authors

Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, Luis Ceze

Links

Abstract / PDF / Code / Data

Why It Matters For Business

FlashInfer can cut inference latency and increase throughput in production LLM services, lowering GPU costs per query and improving user responsiveness.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead

Summary TLDR

FlashInfer is an attention engine for LLM inference that (1) uses a flexible block‑sparse KV‑cache representation and composable formats to save memory and exploit shared prefixes, (2) provides a JIT compiler so users can plug in custom attention variants without hand‑writing kernels, and (3) runs a runtime load‑balanced scheduler that adapts to variable sequence lengths while remaining CUDA‑Graph compatible. On NVIDIA A100/H100 hardware and standard serving setups, FlashInfer cuts inter‑token latency by 29–69% vs. a Triton backend, reduces long‑context latency by ~28–30%, and speeds parallel generation by 13–17%. Code is open source.

Problem Statement

Existing attention kernels either assume dense, fixed layouts or target a narrow set of workloads. Real LLM serving mixes variable sequence lengths, shared prefixes, and many attention variants, causing memory inefficiency, poor kernel utilization, and load imbalance. The paper presents a single engine that unifies KV storage, supports many attention variants with JIT compilation, and schedules work dynamically for inference serving.

Main Contribution

Unified block‑sparse KV‑cache format and composable formats to represent diverse KV storage layouts and shared prefixes.

A customizable attention template plus JIT compiler that emits optimized CUDA/CUTLASS kernels for many attention variants.

Key Findings

FlashInfer reduces inter‑token latency versus a Triton backend in LLM serving

Numbers29–69% ITL reduction (Sec. 4, Abstract)

Practical UseDrop per‑token latency by up to ~2/3; try FlashInfer as a backend to speed online text generation.

Evidence RefAbstract; Sec.4 (Fig.7)

FlashInfer lowers latency for long‑context inference when kernels are fused

Numbers28–30% ITL reduction for Streaming‑LLM with fused RoPE (Sec.4.3, Fig.9)

Practical UseFor million‑token or long‑context use cases, implement fused query/key transforms in FlashInfer to cut latency by ~30%.

Evidence RefSec.4.3 (Fig.9)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Inter‑Token Latency (ITL) vs Triton backend	29–69% reduction	SGLang + Triton	29–69% lower ITL	ShareGPT, Llama 3.1 8B/70B (Sec.4.1)	Sec.4.1, Fig.7	Fig.7
Long‑context Streaming‑LLM latency	28–30% reduction	Unfused kernels / FlashAttention	28–30% lower ITL	Streaming‑LLM on Vicuna‑13B (Sec.4.3)	Sec.4.3, Fig.9	Fig.9

What To Try In 7 Days

Replace attention backend in a staging vLLM/SGLang deployment with FlashInfer to measure ITL/TTFT changes.

JIT‑compile a fused attention+RoPE kernel for a long‑context pipeline and test end‑to‑end latency.

Enable composable formats for a parallel‑generation workload with shared prefixes and measure TTFT/ITL gains.

Optimization Features

Token Efficiency

Supports fine‑grained KV pruning and vector sparsity

Infra Optimization

TMA use on Hopper for contiguous loads (where applicable)Asynchronous host->device plan uploads and pinned host buffers

System Optimization

CUTLASS/CUDA templates and tile selection heuristicsCUDA Graphs compatibility for fixed grid replay

Inference Optimization

Block‑sparse KV‑cache (arbitrary (Br,Bc) support)Composable formats for shared prefixesJIT‑compiled attention templates for custom variantsDynamic load‑balanced scheduling for variable sequencesFP8 KV‑cache with mixed‑precision attention

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/flashinfer-ai/flashinfer http://flashinfer.ai

Data URLs

https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json https://github.com/attention‑benchmarks/MT‑Bench

Risks & Boundaries

Limitations

Forward‑only attention kernels; no training/backward support yet.

Designed and evaluated on NVIDIA GPUs (CUDA/CUTLASS); portability to other hardware is future work.

When Not To Use

When you need attention backward kernels for training today.

On non‑NVIDIA hardware or backends where CUTLASS/TMA are unavailable.

Failure Modes

Workspace underallocation if planner upper bounds are too low, causing runtime failures.

Performance regressions when sparse gathering falls back to less efficient copies (e.g., certain FA3 paths).

Core Entities

Models

Llama 3.1 8BLlama 3.1 70BVicuna-13B

Metrics

Inter‑Token Latency (ITL)Time‑To‑First‑Token (TTFT)Throughput (tokens/s)TFLOPs / bandwidth utilization

Datasets

ShareGPTMT-BenchSynthetic variable workload (512–2048)

Benchmarks

LLM serving benchmark (ITL/TTFT)AttentionGym (variant tests)Streaming‑LLM long‑context benchmark

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

FlashInfer reduces inter‑token latency versus a Triton backend in LLM serving

FlashInfer lowers latency for long‑context inference when kernels are fused

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Make LLM judges reliable and cheaper: EIF and tuned PPI give tighter, valid evaluation with few labels

Key finding

Pipeline that combines synthetic-data distillation, LoRA, Muon and GPTQ to make task-specialized LLMs fit on edge devices

Key finding

Train a tiny 'judge' on top of target embeddings to accept many more draft tokens and speed up large-model generation up to ~9× without loss

Key finding

Skip 25–30% of expensive FFN blocks to speed decoding while keeping knowledge accuracy

Key finding

Practical survey of quantization, pruning, distillation, and decoding tricks to make LLMs cheaper and faster

Key finding