Speed up LLM serving by aggregating small models, adapting speculation length, and pipelining verification

February 24, 20248 min

Overview

Decision SnapshotReady For Pilot

Minions is implemented and evaluated on realistic GPUs and datasets; results show strong speedups but rely on careful SSM finetuning, KVCache management, and integration with the serving stack.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 65%

Authors

Siqi Wang, Hailong Yang, Xuezhu Wang, Tongxuan Liu, Pengbo Wang, Xuning Liang, Kejie Ma, Tianyu Feng, Xin You, Yongjun Bao, Yi Liu, Zhongzhi Luan, Depei Qian

Links

Abstract / PDF / Data

Why It Matters For Business

Minions can materially reduce serving latency and multiply throughput for conversational LLMs without retraining large models, lowering operational cost and improving user responsiveness on evaluated workloads.

Who Should Care

Summary TLDR

Minions is a serving system that speeds up autoregressive LLM inference without retraining large models. It runs multiple small speculative models (SSMs) in parallel, picks majority-approved token streams, dynamically adapts how many tokens to speculate, and pipelines SSM decoding with LLM verification. On evaluated Llama2-70B and OPT-13B setups, Minions raises acceptance rates of speculative outputs and gives roughly 2–3× average throughput and latency speedups versus a strong baseline (vLLM), with overheads under ~2.4%.

Problem Statement

Autoregressive LLMs generate tokens one-by-one, causing low parallelism and heavy KVCache pressure. Existing speedups (pruning, quant, retraining) need model changes or tuning. Speculative decoding uses small models but suffers low acceptance rates, high verification cost, and idle time from tightly coupled SSM+LLM execution. The paper asks how to raise acceptance, pick speculation length efficiently, and remove idle time.

Main Contribution

Majority-voted speculator: combine multiple small speculative models (SSMs) with weighted, tree-based voting to raise token acceptance without extra LLM work.

Adaptive speculation-length selector: online heuristic that tunes how many tokens SSMs predict to balance verification cost and verified token length.

Key Findings

Majority voting raises acceptance rates of SSM outputs, improving throughput.

NumbersOPT-13B acceptance rates up to 0.87/0.89/0.78 (finance/chatbot/dialogue); Llama2-70B-chat ~0.54/0.49/0.55

Practical UseUse a small ensemble of SSMs with runtime weights to increase accepted speculative tokens and boost serving throughput without extra LLM verification cost.

Evidence RefSection 5.4, Figure 11

Adaptive selection of speculation length gives large throughput gains compared to a fixed choice.

NumbersSelector added +35.07% throughput (OPT-13B) and +5.91% (Llama2-70B-chat) in ablation

Practical UseMonitor LLM verification timing and verified token length and tune speculation length online instead of using a fixed s.

Evidence RefSection 5.3 (Ablation)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Throughput vs vLLM≈3.02× (Llama2-70B finance, avg)vLLM≈+202%finance (Llama2-70B)Section 5.2, Figure 9; paper reports ~3.02× throughput vs vLLMSection 5.2, Figure 9
Normalized latency vs vLLM≈3.10× speedup (Llama2-70B finance, avg)vLLM≈+210%finance (Llama2-70B)Section 5.2, Figure 8; paper reports average speedup 3.10× vs vLLMSection 5.2, Figure 8

What To Try In 7 Days

Prototype Minions on top of vLLM for one LLM (e.g., OPT-13B or Llama2-70B) using 2–3 fine-tuned SSMs.

Fine-tune or distill small SSMs on your domain data to raise acceptance rate.

Enable an intermediate result pool and implement simple throttling so SSMs don’t overflow KVCache memory in GPU serving trials with NVIDIA MPS or equivalent sharing tech.

Optimization Features

Token Efficiency
Speculation acceptance improvement
Infra Optimization
NVIDIA MPS for concurrent GPU sharingBatch-size-aware speculation control
System Optimization
KVCache-aware throttlingIntermediate result pool (decoupling)
Training Optimization
SSM distillation (fine-tuning)
Inference Optimization
Speculative DecodingMajority VotingAdaptive Speculation LengthPipelined Execution

Reproducibility

Risks & Boundaries

Limitations

Intermediate result pool can accumulate KVCache and trigger swapping or recomputation if not throttled carefully (memory pressure).

Requires a set of effective SSMs; poor SSM alignment lowers acceptance and gains.

When Not To Use

If you already run highly optimized, compiled inference stacks (e.g., TensorRT-LLM) that outperform speculative pipelines in your config.

When you cannot host additional SSM instances due to strict memory limits.

Failure Modes

Low acceptance rate from SSM ensemble reduces net speedup or increases LLM verification overhead.

Uncontrolled SSM speculation fills GPU memory with KVCache, causing swapping and slower end-to-end performance.

Core Entities

Models

Llama2-70B-chatOPT-13BLlama-160M (SSM)OPT-125M (SSM)

Metrics

normalized latency (mean latency / output length)throughput (requests/sec)acceptance rate (verified tokens / speculated tokens)

Datasets

Empathetic_Dialogues (dialogue)Chatbot Instruction Prompts (chatbot)Finance Alpaca (finance)