FlashInfer: a JIT‑compiled, block‑sparse attention engine that cuts LLM inference latency and supports custom attention variants

January 2, 20257 min

Overview

Production Readiness

0.8

Novelty Score

0.6

Cost Impact Score

0.75

Citation Count

2

Authors

Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, Luis Ceze

Links

Abstract / PDF

Why It Matters For Business

FlashInfer can cut inference latency and increase throughput in production LLM services, lowering GPU costs per query and improving user responsiveness.

Summary TLDR

FlashInfer is an attention engine for LLM inference that (1) uses a flexible block‑sparse KV‑cache representation and composable formats to save memory and exploit shared prefixes, (2) provides a JIT compiler so users can plug in custom attention variants without hand‑writing kernels, and (3) runs a runtime load‑balanced scheduler that adapts to variable sequence lengths while remaining CUDA‑Graph compatible. On NVIDIA A100/H100 hardware and standard serving setups, FlashInfer cuts inter‑token latency by 29–69% vs. a Triton backend, reduces long‑context latency by ~28–30%, and speeds parallel generation by 13–17%. Code is open source.

Problem Statement

Existing attention kernels either assume dense, fixed layouts or target a narrow set of workloads. Real LLM serving mixes variable sequence lengths, shared prefixes, and many attention variants, causing memory inefficiency, poor kernel utilization, and load imbalance. The paper presents a single engine that unifies KV storage, supports many attention variants with JIT compilation, and schedules work dynamically for inference serving.

Main Contribution

Unified block‑sparse KV‑cache format and composable formats to represent diverse KV storage layouts and shared prefixes.

A customizable attention template plus JIT compiler that emits optimized CUDA/CUTLASS kernels for many attention variants.

A dynamism‑aware, load‑balanced scheduler that maps variable sequence workloads to fixed CUDAGraph‑compatible kernels.

End‑to‑end integration with popular serving frameworks and comprehensive kernel and serving evaluations on A100/H100 GPUs.

Key Findings

FlashInfer reduces inter‑token latency versus a Triton backend in LLM serving

Numbers29–69% ITL reduction (Sec. 4, Abstract)

FlashInfer lowers latency for long‑context inference when kernels are fused

Numbers28–30% ITL reduction for Streaming‑LLM with fused RoPE (Sec.4.3, Fig.9)

Composable block‑sparse formats speed parallel generation with shared prefixes

Numbers13–17% ITL speedup at moderate parallel tokens (n=4 peak) (Sec.4.4, Fig.10)

Fusing RoPE into attention raises kernel bandwidth usage

Numbers1.6–3.7× higher bandwidth for fused RoPE kernel vs. unfused (Sec.4.3, Fig.9)

Results

Inter‑Token Latency (ITL) vs Triton backend

Value29–69% reduction

BaselineSGLang + Triton

Long‑context Streaming‑LLM latency

Value28–30% reduction

BaselineUnfused kernels / FlashAttention

Parallel generation speedup (ITL/TTFT)

Value13–17% speedup (peak at n=4)

BaselineSingle block format

Kernel bandwidth for fused RoPE

Value1.6–3.7× higher bandwidth

BaselineUnfused RoPE + attention

Who Should Care

What To Try In 7 Days

Replace attention backend in a staging vLLM/SGLang deployment with FlashInfer to measure ITL/TTFT changes.

JIT‑compile a fused attention+RoPE kernel for a long‑context pipeline and test end‑to‑end latency.

Enable composable formats for a parallel‑generation workload with shared prefixes and measure TTFT/ITL gains.

Optimization Features

Token Efficiency

  • Supports fine‑grained KV pruning and vector sparsity

Infra Optimization

  • TMA use on Hopper for contiguous loads (where applicable)
  • Asynchronous host->device plan uploads and pinned host buffers

System Optimization

  • CUTLASS/CUDA templates and tile selection heuristics
  • CUDA Graphs compatibility for fixed grid replay

Inference Optimization

  • Block‑sparse KV‑cache (arbitrary (Br,Bc) support)
  • Composable formats for shared prefixes
  • JIT‑compiled attention templates for custom variants
  • Dynamic load‑balanced scheduling for variable sequences
  • FP8 KV‑cache with mixed‑precision attention

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Forward‑only attention kernels; no training/backward support yet.
  • Designed and evaluated on NVIDIA GPUs (CUDA/CUTLASS); portability to other hardware is future work.
  • Sparse gathering cannot use Hopper TMA, causing ~10% prefill gap for some sparse layouts.
  • Requires precomputed upper bounds for workspace sections; incorrect sizing can cause errors or wasted memory.

When Not To Use

  • When you need attention backward kernels for training today.
  • On non‑NVIDIA hardware or backends where CUTLASS/TMA are unavailable.
  • For tiny models or extremely short sequences where kernel overhead outweighs benefits.

Failure Modes

  • Workspace underallocation if planner upper bounds are too low, causing runtime failures.
  • Performance regressions when sparse gathering falls back to less efficient copies (e.g., certain FA3 paths).
  • Integration overhead (Python side, e.g., vLLM) masking kernel gains until moved to C++/device.

Core Entities

Models

  • Llama 3.1 8B
  • Llama 3.1 70B
  • Vicuna-13B

Metrics

  • Inter‑Token Latency (ITL)
  • Time‑To‑First‑Token (TTFT)
  • Throughput (tokens/s)
  • TFLOPs / bandwidth utilization

Datasets

  • ShareGPT
  • MT-Bench
  • Synthetic variable workload (512–2048)

Benchmarks

  • LLM serving benchmark (ITL/TTFT)
  • AttentionGym (variant tests)
  • Streaming‑LLM long‑context benchmark