Falcon: a semi‑autoregressive drafter + decoding tree that yields ~3× lossless LLM decoding speedup

December 17, 20247 min

Overview

Production Readiness

0.8

Novelty Score

0.7

Cost Impact Score

0.8

Citation Count

0

Authors

Xiangxiang Gao, Weisheng Xie, Yiwei Xiang, Feng Ji

Links

Abstract / PDF

Why It Matters For Business

Falcon can cut greedy LLM inference latency by ~3× with a small extra drafter, enabling faster real-time responses on constrained hardware without retraining the full LLM.

Summary TLDR

Falcon is a semi‑autoregressive speculative decoding system that trains a tiny drafter (roughly two Transformer layers) using Coupled Sequential Glancing Distillation (CSGD) and runs a custom decoding tree to propose many tokens per forward pass. On Vicuna and LLaMA2-Chat the authors report lossless speedups of about 2.91×–3.51× versus standard autoregressive decoding while keeping high acceptance rates (~75–80%). Falcon trades a small, trainable drafter to reduce memory-bound model parameter reads and cut wall-clock latency for greedy decoding.

Problem Statement

Autoregressive decoding of large models is slow and memory-bandwidth bound. Existing speculative decoders either run sequential drafters (slow) or parallel drafters (faster but less accurate) because they fail to capture dependencies among tokens generated in the same block. The result: a tension between low drafting latency and high token acceptance by the main LLM.

Main Contribution

Falcon: an enhanced semi‑autoregressive (SAR) speculative decoding framework that raises drafter parallelism and draft quality.

Coupled Sequential Glancing Distillation (CSGD): a training approach that strengthens inter-token dependence inside each drafted block.

Custom-designed decoding tree: organizes multi-token drafts and supports multiple forward passes to raise acceptance and speed.

Empirical wins: lossless wall-time speedups of ~2.91×–3.51× on Vicuna / LLaMA2-Chat across MT-Bench, HumanEval, GSM8K with a compact drafter.

Key Findings

Falcon achieves a lossless wall-time speedup of roughly 2.91×–3.51× versus vanilla autoregressive decoding on evaluated models.

Numbersspeedup 2.91x–3.51x (Table 1; MT-Bench/HumanEval/GSM8K)

Falcon raises token acceptance rates to about 74%–80%, beating SAR baseline Medusa by ~16 percentage points on tested settings.

NumbersFalcon α ≈ 74–80% vs Medusa ≈ 61% (Table 2)

A compact drafter (size comparable to two Transformer layers) is sufficient to get the reported gains.

Numbersdrafter ≈ two Transformer layers (abstract & conclusion)

CSGD and tree attention both materially improve metrics: CSGD gave ~1.17× speedup uplift and +3.26% in acceptance; tree attention gave ~1.12× speedup uplift and +1.22% in acceptance (ablation).

NumbersAblation: CSGD +1.17x speedup, +3.26% α; Tree attention +1.12x speedup, +1.22% α (Table 4)

Results

Wall-time speedup (MT-Bench, greedy)

ValueV7B: 3.10x; V13B: 2.97x; LC13B: 3.11x

Baselinevanilla AR decoding

Wall-time speedup (range across benchmarks)

Value2.91x–3.51x

Baselinevanilla AR decoding

Acceptance rate (α)

ValueFalcon ≈ 74%–80% (varies by model/dataset)

BaselineMedusa ≈ 61%; Eagle ≈ 72%–76%

Average acceptance length (τ)

ValueFalcon V7B τ = 3.34 (MT-Bench), Medusa V7B τ = 1.51

BaselineMedusa V7B τ = 1.51

Drafter size

ValueCompact: roughly two Transformer layers plus hybrid LSTM head

Baselineother drafters that use billions of parameters

Who Should Care

What To Try In 7 Days

Run the Falcon repo on a dev GPU and reproduce MT-Bench speedups with your LLM and batch size 1.

Train a compact SAR head (2-layer) on a small ShareGPT subset to test acceptance rate gains.

Tune k and tree shape: measure acceptance (α) and τ trade-offs for your task and latency budget.

Optimization Features

Token Efficiency

  • Higher token acceptance rate reduces wasted verification

Infra Optimization

  • Evaluated on H800 server; designed for memory-bound GPUs

Model Optimization

  • Compact drafter design (two Transformer layers)

System Optimization

  • Reduces memory-bound parameter reads by batching draft/verify work

Training Optimization

  • Coupled Sequential Glancing Distillation (CSGD)
  • data augmentation with uniform noise to features

Inference Optimization

  • Semi-autoregressive drafting (multi-token per forward)
  • Custom decoding tree with tree attention
  • Parallel draft verification (tree-based)

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Evaluations used greedy decoding (temperature=0); gains may differ with sampling or higher temperatures.
  • Tree structure is hand-tuned per model family and may need task-specific engineering.
  • Reported results use batch size 1 and H800 GPU; multi-GPU or different hardware may change speedups.
  • Larger k reduces acceptance and τ; trade-offs require tuning for each workload.

When Not To Use

  • When you rely on non-greedy sampling or stochastic decoding (temperature>0) without further validation.
  • If you cannot afford to train a small drafter head or adjust tree structure for your model.
  • When exact replication of LLM output order under different sampling regimes is required.

Failure Modes

  • Large k values lower acceptance, increasing verification overhead and reducing net speedup.
  • Poorly chosen decoding tree can hurt acceptance and negate latency gains.
  • Drafter mistakes cause extra verification rounds and reduce throughput.

Core Entities

Models

  • Vicuna-7B
  • Vicuna-13B
  • LLaMA2-Chat-7B
  • LLaMA2-Chat-13B
  • Falcon drafter (two-layer hybrid Transformer + LSTM head)
  • Medusa
  • Eagle
  • Lookahead
  • SPS
  • PLD

Metrics

  • Wall-time speedup ratio
  • Acceptance rate (α)
  • Average acceptance length (τ)

Datasets

  • MT-Bench
  • HumanEval
  • GSM8K
  • ShareGPT (training data)

Benchmarks

  • MT-Bench
  • HumanEval
  • GSM8K