Overview
Production Readiness
0.8
Novelty Score
0.7
Cost Impact Score
0.8
Citation Count
0
Why It Matters For Business
Falcon can cut greedy LLM inference latency by ~3× with a small extra drafter, enabling faster real-time responses on constrained hardware without retraining the full LLM.
Summary TLDR
Falcon is a semi‑autoregressive speculative decoding system that trains a tiny drafter (roughly two Transformer layers) using Coupled Sequential Glancing Distillation (CSGD) and runs a custom decoding tree to propose many tokens per forward pass. On Vicuna and LLaMA2-Chat the authors report lossless speedups of about 2.91×–3.51× versus standard autoregressive decoding while keeping high acceptance rates (~75–80%). Falcon trades a small, trainable drafter to reduce memory-bound model parameter reads and cut wall-clock latency for greedy decoding.
Problem Statement
Autoregressive decoding of large models is slow and memory-bandwidth bound. Existing speculative decoders either run sequential drafters (slow) or parallel drafters (faster but less accurate) because they fail to capture dependencies among tokens generated in the same block. The result: a tension between low drafting latency and high token acceptance by the main LLM.
Main Contribution
Falcon: an enhanced semi‑autoregressive (SAR) speculative decoding framework that raises drafter parallelism and draft quality.
Coupled Sequential Glancing Distillation (CSGD): a training approach that strengthens inter-token dependence inside each drafted block.
Custom-designed decoding tree: organizes multi-token drafts and supports multiple forward passes to raise acceptance and speed.
Empirical wins: lossless wall-time speedups of ~2.91×–3.51× on Vicuna / LLaMA2-Chat across MT-Bench, HumanEval, GSM8K with a compact drafter.
Key Findings
Falcon achieves a lossless wall-time speedup of roughly 2.91×–3.51× versus vanilla autoregressive decoding on evaluated models.
Falcon raises token acceptance rates to about 74%–80%, beating SAR baseline Medusa by ~16 percentage points on tested settings.
A compact drafter (size comparable to two Transformer layers) is sufficient to get the reported gains.
CSGD and tree attention both materially improve metrics: CSGD gave ~1.17× speedup uplift and +3.26% in acceptance; tree attention gave ~1.12× speedup uplift and +1.22% in acceptance (ablation).
Results
Wall-time speedup (MT-Bench, greedy)
Wall-time speedup (range across benchmarks)
Acceptance rate (α)
Average acceptance length (τ)
Drafter size
Who Should Care
What To Try In 7 Days
Run the Falcon repo on a dev GPU and reproduce MT-Bench speedups with your LLM and batch size 1.
Train a compact SAR head (2-layer) on a small ShareGPT subset to test acceptance rate gains.
Tune k and tree shape: measure acceptance (α) and τ trade-offs for your task and latency budget.
Optimization Features
Token Efficiency
- Higher token acceptance rate reduces wasted verification
Infra Optimization
- Evaluated on H800 server; designed for memory-bound GPUs
Model Optimization
- Compact drafter design (two Transformer layers)
System Optimization
- Reduces memory-bound parameter reads by batching draft/verify work
Training Optimization
- Coupled Sequential Glancing Distillation (CSGD)
- data augmentation with uniform noise to features
Inference Optimization
- Semi-autoregressive drafting (multi-token per forward)
- Custom decoding tree with tree attention
- Parallel draft verification (tree-based)
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Evaluations used greedy decoding (temperature=0); gains may differ with sampling or higher temperatures.
- Tree structure is hand-tuned per model family and may need task-specific engineering.
- Reported results use batch size 1 and H800 GPU; multi-GPU or different hardware may change speedups.
- Larger k reduces acceptance and τ; trade-offs require tuning for each workload.
When Not To Use
- When you rely on non-greedy sampling or stochastic decoding (temperature>0) without further validation.
- If you cannot afford to train a small drafter head or adjust tree structure for your model.
- When exact replication of LLM output order under different sampling regimes is required.
Failure Modes
- Large k values lower acceptance, increasing verification overhead and reducing net speedup.
- Poorly chosen decoding tree can hurt acceptance and negate latency gains.
- Drafter mistakes cause extra verification rounds and reduce throughput.
Core Entities
Models
- Vicuna-7B
- Vicuna-13B
- LLaMA2-Chat-7B
- LLaMA2-Chat-13B
- Falcon drafter (two-layer hybrid Transformer + LSTM head)
- Medusa
- Eagle
- Lookahead
- SPS
- PLD
Metrics
- Wall-time speedup ratio
- Acceptance rate (α)
- Average acceptance length (τ)
Datasets
- MT-Bench
- HumanEval
- GSM8K
- ShareGPT (training data)
Benchmarks
- MT-Bench
- HumanEval
- GSM8K

