Prune far-away masks and stop confident tokens early to make diffusion LLMs much faster at inference

January 25, 20267 min

Overview

Production Readiness

0.75

Novelty Score

0.65

Cost Impact Score

0.85

Citation Count

0

Authors

Zhongyu Xiao, Zhiwei Hao, Jianyuan Guo, Yong Luo, Jia Liu, Jie Xu, Han Hu

Links

Abstract / PDF

Why It Matters For Business

Streaming-dLLM cuts inference compute and latency dramatically for diffusion LLMs without retraining. That reduces cloud GPU cost and improves responsiveness for production services that use dLLMs for long or batch generation.

Summary TLDR

Streaming-dLLM is a training-free inference method for diffusion LLMs that (1) prunes most of the uninformative suffix tokens using a small sliding window plus a trailing positional cue, and (2) applies an adaptive confidence-based parallel decoding with early exit. On multiple dLLM backbones and benchmarks the method yields large throughput and latency gains (dozens×), while keeping accuracy close to or slightly better than baselines on evaluated tasks.

Problem Statement

Diffusion LLMs decode many masked tokens in parallel and repeatedly attend to a long suffix of mostly uninformative masks. This wastes compute spatially (attending to redundant suffix tokens) and temporally (fixed thresholds force repeated denoising for already-converged tokens).

Main Contribution

Attenuation-guided suffix modeling: keep only a small sliding window of nearby suffix blocks plus a trailing position token to approximate global structure and cut attention cost.

Dynamic confidence-aware parallel decoding: adapt the acceptance threshold during block denoising so high-confidence tokens finalize earlier.

Early-exit for block diffusion: stop decoding remaining blocks when an EOS prediction is reached with high confidence.

A training-free end-to-end framework (Streaming-dLLM) that plugs into existing dLLMs and significantly raises throughput and lowers latency on real benchmarks.

Key Findings

Large throughput gains while preserving task accuracy.

Numbers68.2× speedup on MBPP with LLaDA-1.5 (gen length 512); accuracy 38.4%

Extreme speedups for very long outputs.

NumbersUp to 225.3× speedup at generation length 2048 (LLaDA-1.5 on GSM8K/configs reported)

Per-sample latency drops substantially vs prior acceleration.

NumbersUp to 85.5% reduction in per-sample inference latency vs Fast-dLLM (reported across settings)

All components contribute; combined gives the largest gain.

NumbersLLaDA-1.5 throughput on GSM8K (512): 25.8 → 69.8 tok/s when enabling suffix+dynamic+early-exit (~2.7×)

Trailing positional info matters for quality.

NumbersAccuracy drops when trailing position is removed (e.g., LLaDA-1.5 81.2% → 79.6%)

Results

throughput (tokens/s)

Value68.2×

Baselinevanilla LLaDA-1.5

throughput (tokens/s)

Value225.3×

Baselinevanilla Dream (or LLaDA variant) at long context

inference latency (s per sample)

Valueup to 85.5% reduction

BaselineFast-dLLM

Accuracy

Valuecomparable

Baselinevanilla dLLM backbones

Who Should Care

What To Try In 7 Days

Run the provided GitHub code on one dLLM backbone (e.g., LLaDA-1.5) and one benchmark to reproduce throughput/latency gains.

Enable suffix pruning (sliding window + trailing position) with recommended window sizes from Table 11 and compare tokens/s and accuracy.

Tune the adaptive threshold alpha (start 0.3–0.6) to trade throughput vs stability for your task.

Optimization Features

Token Efficiency

  • Finalize high-confidence tokens early to avoid further denoising

Infra Optimization

  • Reported on single NVIDIA A800 80GB; throughput measured in tokens/s

System Optimization

  • Reuse KV for prefix across block iterations

Inference Optimization

  • Suffix pruning (sliding window of nearby suffix blocks)
  • Trailing positional token to preserve global order
  • Adaptive confidence thresholding for parallel decoding
  • Early-exit on confident EOS

Reproducibility

Data Urls

  • HumanEval, GSM8K, MBPP, MATH (standard public benchmarks)

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Method targets block-wise diffusion LLMs; not applicable to standard autoregressive models without adaptation.
  • Requires task-specific hyperparameter tuning (window size w, base threshold τ0, alpha).
  • Quality may degrade if the retained suffix window omits distant tokens that are semantically important.
  • Reported experiments use a single A800 GPU; multi-GPU/distributed behavior is not evaluated here.

When Not To Use

  • When the suffix carries important fine-grained semantic cues that must be attended to at every step.
  • For small-generation tasks (short sequences) where suffix pruning gives little benefit.
  • If you cannot afford the validation needed to tune adaptive thresholding for your domain.

Failure Modes

  • Over-aggressive window reduction (w too small) causes accuracy drop.
  • Too high alpha (aggressive parallelism) causes premature finalization and decoding instability.
  • Early-exit triggered prematurely may truncate outputs if EOS confidence is miscalibrated.

Core Entities

Models

  • Dream-v0-7B-Base
  • LLaDA-8B-Instruct
  • LLaDA-1.5

Metrics

  • throughput (tokens/s)
  • inference latency (s per sample)
  • Accuracy

Datasets

  • HumanEval
  • GSM8K
  • MBPP
  • MATH

Benchmarks

  • HumanEval
  • GSM8K
  • MBPP
  • MATH

Context Entities

Models

  • Fast-dLLM
  • dKV-Cache
  • Prefix-Cache