LazyLLM: compute KV only for important tokens to speed up long-context LLMs

Overview

Decision SnapshotReady For Pilot

The approach is simple, works with off-the-shelf checkpoints, and shows consistent walltime improvements across tasks; code release is not stated which lowers immediate reproducibility.

Citations1

Evidence Strength0.80

Confidence0.82

Risk Signals9

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 60%

Authors

Qichen Fu, Minsik Cho, Thomas Merth, Sachin Mehta, Mohammad Rastegari, Mahyar Najibi

Links

Abstract / PDF / Data

Why It Matters For Business

LazyLLM cuts the time-to-first-token on long prompts and lowers total token computation without model retraining, reducing latency and inference costs for long-context applications.

Who Should Care

ML Engineer Product Manager Engineering Lead CTO Founder

Summary TLDR

LazyLLM speeds up inference on long prompts by computing key-value (KV) cache only for tokens judged important for the next token. It prunes tokens progressively across transformer layers, allows pruned tokens to be revived later using an auxiliary cache, and needs no fine-tuning. On LongBench with Llama 2 7B and XGen 7B it gives multi-document QA TTFT (time-to-first-token) speedups up to ~2.3–2.7× while keeping accuracy nearly unchanged and often reduces total prompt computation to ~64% in multi-doc QA.

Problem Statement

Prefilling a long prompt requires computing KV cache for every token, which makes the time-to-first-token (TTFT) slow and can become a bottleneck. The paper asks whether many prompt tokens can be skipped during prefilling without harming the first-token prediction, and how to skip them dynamically without repeated recomputation.

Main Contribution

A dynamic token-pruning method (LazyLLM) that computes KV only for tokens important to the next-token prediction and defers others.

A progressive, layer-wise pruning schedule that keeps more tokens in earlier layers and prunes more in later layers.

Key Findings

LazyLLM achieves 2.34× TTFT speedup on multi-document QA with negligible accuracy loss on Llama 2 7B.

Numbers2.34× TTFT; score 22.31 vs baseline 22.43 (Δ -0.12)

Practical UseIf you serve Llama 2 on long multi-doc prompts, apply LazyLLM to cut time-to-first-token by ~2.3× while keeping accuracy effectively unchanged.

Evidence RefTable 1

LazyLLM reduces the cumulative prompt tokens computed to ~63.94% in multi-document QA, yielding additional end-to-end speedup.

Numbers%PromptTokens=63.94%; overall generation speedup=1.56×

Practical UseYou will process fewer tokens overall (less compute and memory traffic), which can reduce total generation cost and improve throughput.

Evidence RefTable 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
TTFT Speedup (multi-document QA, Llama 2 7B)	2.34×	1.00×	×2.34	LongBench multi-document QA (macro avg)	Table 1 shows LazyLLM TTFT speedup 2.34× and score 22.31 vs baseline 22.43	Table 1
Overall generation speedup (multi-document QA, Llama 2 7B)	1.56×	1.00×	×1.56	LongBench multi-document QA	Table 2 reports overall generation speedup 1.56 and % prompt token computed 63.94	Table 2

What To Try In 7 Days

Run LazyLLM as a wrapper in your inference stack on a few long-prompt workloads and measure TTFT.

Tune pruning layers and top-k percentile to find an accuracy-speed sweet spot on your tasks.

Enable Aux Cache and verify worst-case latency matches baseline to avoid regressions.

Optimization Features

Token Efficiency

reduces % prompt tokens computed (e.g., to 63.94% in multi-doc QA)saves attention and FFN work by skipping tokens

System Optimization

plug-and-play with existing checkpoints; no finetuning required

Inference Optimization

dynamic token pruning during prefilling and decodingprogressive layer-wise pruning (keep more tokens early)Aux Cache to avoid re-computing pruned tokens

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

https://github.com/THUDM/LongBench https://github.com/huggingface/transformers/

Risks & Boundaries

Limitations

Requires careful tuning of pruning layers and top-k thresholds to balance speed and accuracy.

Effectiveness varies by task; summarization shows near-full token usage and little overall speedup.

When Not To Use

When prompts are short or tasks need nearly all tokens (e.g., some summarization cases).

When you cannot afford experimentation to tune pruning parameters.

Failure Modes

Over-aggressive pruning can cause notable accuracy drops (authors report up to ~10% loss at high speed settings).

If Aux Cache is mismanaged, revival of tokens may introduce overhead and negate speed gains.

Core Entities

Models

Llama 2 7BXGen 7B

Metrics

TTFT SpeedupOverall Generation SpeedupPercent Prompt Token ComputedROUGE-LF1AccuracyEdit Sim

Datasets

LongBench (16 datasets multi-task benchmark)

Benchmarks

LongBench

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

LazyLLM achieves 2.34× TTFT speedup on multi-document QA with negligible accuracy loss on Llama 2 7B.

LazyLLM reduces the cumulative prompt tokens computed to ~63.94% in multi-document QA, yielding additional end-to-end speedup.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Skip 25–30% of expensive FFN blocks to speed decoding while keeping knowledge accuracy

Key finding

KV-CoRE: an SVD-based tool and benchmark that measures how compressible LLM KV-caches are, per layer and per dataset.

Key finding

Share the common KV cache across LoRA-adapted agents and keep tiny low-rank adapters to cut memory and speed up multi-agent inference.

Key finding

KV-cache compression breaks attention routing: reachability, a 90% safety cliff, and two failure modes

Key finding

Use per-token unstructured pruning + a bitmap sparse kernel to cut KV cache to ~45% size and speed decoding up to 2.23×

Key finding