LazyLLM: compute KV only for important tokens to speed up long-context LLMs

July 19, 20247 min

Overview

Decision SnapshotReady For Pilot

The approach is simple, works with off-the-shelf checkpoints, and shows consistent walltime improvements across tasks; code release is not stated which lowers immediate reproducibility.

Citations1

Evidence Strength0.80

Confidence0.82

Risk Signals9

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 60%

Authors

Qichen Fu, Minsik Cho, Thomas Merth, Sachin Mehta, Mohammad Rastegari, Mahyar Najibi

Links

Abstract / PDF / Data

Why It Matters For Business

LazyLLM cuts the time-to-first-token on long prompts and lowers total token computation without model retraining, reducing latency and inference costs for long-context applications.

Who Should Care

Summary TLDR

LazyLLM speeds up inference on long prompts by computing key-value (KV) cache only for tokens judged important for the next token. It prunes tokens progressively across transformer layers, allows pruned tokens to be revived later using an auxiliary cache, and needs no fine-tuning. On LongBench with Llama 2 7B and XGen 7B it gives multi-document QA TTFT (time-to-first-token) speedups up to ~2.3–2.7× while keeping accuracy nearly unchanged and often reduces total prompt computation to ~64% in multi-doc QA.

Problem Statement

Prefilling a long prompt requires computing KV cache for every token, which makes the time-to-first-token (TTFT) slow and can become a bottleneck. The paper asks whether many prompt tokens can be skipped during prefilling without harming the first-token prediction, and how to skip them dynamically without repeated recomputation.

Main Contribution

A dynamic token-pruning method (LazyLLM) that computes KV only for tokens important to the next-token prediction and defers others.

A progressive, layer-wise pruning schedule that keeps more tokens in earlier layers and prunes more in later layers.

Key Findings

LazyLLM achieves 2.34× TTFT speedup on multi-document QA with negligible accuracy loss on Llama 2 7B.

Numbers2.34× TTFT; score 22.31 vs baseline 22.43-0.12)

Practical UseIf you serve Llama 2 on long multi-doc prompts, apply LazyLLM to cut time-to-first-token by ~2.3× while keeping accuracy effectively unchanged.

Evidence RefTable 1

LazyLLM reduces the cumulative prompt tokens computed to ~63.94% in multi-document QA, yielding additional end-to-end speedup.

Numbers%PromptTokens=63.94%; overall generation speedup=1.56×

Practical UseYou will process fewer tokens overall (less compute and memory traffic), which can reduce total generation cost and improve throughput.

Evidence RefTable 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
TTFT Speedup (multi-document QA, Llama 2 7B)2.34×1.00××2.34LongBench multi-document QA (macro avg)Table 1 shows LazyLLM TTFT speedup 2.34× and score 22.31 vs baseline 22.43Table 1
Overall generation speedup (multi-document QA, Llama 2 7B)1.56×1.00××1.56LongBench multi-document QATable 2 reports overall generation speedup 1.56 and % prompt token computed 63.94Table 2

What To Try In 7 Days

Run LazyLLM as a wrapper in your inference stack on a few long-prompt workloads and measure TTFT.

Tune pruning layers and top-k percentile to find an accuracy-speed sweet spot on your tasks.

Enable Aux Cache and verify worst-case latency matches baseline to avoid regressions.

Optimization Features

Token Efficiency
reduces % prompt tokens computed (e.g., to 63.94% in multi-doc QA)saves attention and FFN work by skipping tokens
System Optimization
plug-and-play with existing checkpoints; no finetuning required
Inference Optimization
dynamic token pruning during prefilling and decodingprogressive layer-wise pruning (keep more tokens early)Aux Cache to avoid re-computing pruned tokens

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Requires careful tuning of pruning layers and top-k thresholds to balance speed and accuracy.

Effectiveness varies by task; summarization shows near-full token usage and little overall speedup.

When Not To Use

When prompts are short or tasks need nearly all tokens (e.g., some summarization cases).

When you cannot afford experimentation to tune pruning parameters.

Failure Modes

Over-aggressive pruning can cause notable accuracy drops (authors report up to ~10% loss at high speed settings).

If Aux Cache is mismanaged, revival of tokens may introduce overhead and negate speed gains.

Core Entities

Models

Llama 2 7BXGen 7B

Metrics

TTFT SpeedupOverall Generation SpeedupPercent Prompt Token ComputedROUGE-LF1AccuracyEdit Sim

Datasets

LongBench (16 datasets multi-task benchmark)

Benchmarks

LongBench