Speed up LLM serving by aggregating small models, adapting speculation length, and pipelining verification

February 24, 20248 min

Overview

Production Readiness

0.7

Novelty Score

0.65

Cost Impact Score

0.7

Citation Count

0

Authors

Siqi Wang, Hailong Yang, Xuezhu Wang, Tongxuan Liu, Pengbo Wang, Xuning Liang, Kejie Ma, Tianyu Feng, Xin You, Yongjun Bao, Yi Liu, Zhongzhi Luan, Depei Qian

Links

Abstract / PDF

Why It Matters For Business

Minions can materially reduce serving latency and multiply throughput for conversational LLMs without retraining large models, lowering operational cost and improving user responsiveness on evaluated workloads.

Summary TLDR

Minions is a serving system that speeds up autoregressive LLM inference without retraining large models. It runs multiple small speculative models (SSMs) in parallel, picks majority-approved token streams, dynamically adapts how many tokens to speculate, and pipelines SSM decoding with LLM verification. On evaluated Llama2-70B and OPT-13B setups, Minions raises acceptance rates of speculative outputs and gives roughly 2–3× average throughput and latency speedups versus a strong baseline (vLLM), with overheads under ~2.4%.

Problem Statement

Autoregressive LLMs generate tokens one-by-one, causing low parallelism and heavy KVCache pressure. Existing speedups (pruning, quant, retraining) need model changes or tuning. Speculative decoding uses small models but suffers low acceptance rates, high verification cost, and idle time from tightly coupled SSM+LLM execution. The paper asks how to raise acceptance, pick speculation length efficiently, and remove idle time.

Main Contribution

Majority-voted speculator: combine multiple small speculative models (SSMs) with weighted, tree-based voting to raise token acceptance without extra LLM work.

Adaptive speculation-length selector: online heuristic that tunes how many tokens SSMs predict to balance verification cost and verified token length.

Speculative generation pipeline: decouple and pipeline SSM decoding and LLM verification to overlap work and reduce idle GPU time.

Prototype implementation (Minions) built on vLLM and empirical evaluation on Llama2-70B/OPT-13B showing substantial latency and throughput gains.

Key Findings

Majority voting raises acceptance rates of SSM outputs, improving throughput.

NumbersOPT-13B acceptance rates up to 0.87/0.89/0.78 (finance/chatbot/dialogue); Llama2-70B-chat ~0.54/0.49/0.55

Adaptive selection of speculation length gives large throughput gains compared to a fixed choice.

NumbersSelector added +35.07% throughput (OPT-13B) and +5.91% (Llama2-70B-chat) in ablation

Pipelining SSM decoding and LLM verification markedly reduces idle time and increases throughput.

NumbersPipeline added +34.27% (OPT-13B) and +55.37% (Llama2-70B-chat) throughput in ablation

End-to-end system speedups over a strong baseline (vLLM) are substantial on evaluated workloads.

NumbersAverage speedups vs vLLM ~3.10× (finance), 2.86× (chatbot), 2.70× (dialogue) for Llama2-70B; similar ~3× gains in OPT-13

System overhead for monitoring, tree voting, and weight updates is small.

NumbersOverhead ≤ 2.37% of total inference time

Results

Throughput vs vLLM

Value≈3.02× (Llama2-70B finance, avg)

BaselinevLLM

Normalized latency vs vLLM

Value≈3.10× speedup (Llama2-70B finance, avg)

BaselinevLLM

Acceptance rate (majority-voted SSMs)

ValueOPT-13B: up to 0.89; Llama2-70B-chat: ~0.55

Baselinesingle SSM/previous speculative decoding

Control logic overhead

Value≤2.37% of inference time

Baselinetotal inference time

Who Should Care

What To Try In 7 Days

Prototype Minions on top of vLLM for one LLM (e.g., OPT-13B or Llama2-70B) using 2–3 fine-tuned SSMs.

Fine-tune or distill small SSMs on your domain data to raise acceptance rate.

Enable an intermediate result pool and implement simple throttling so SSMs don’t overflow KVCache memory in GPU serving trials with NVIDIA MPS or equivalent sharing tech.

Optimization Features

Token Efficiency

  • Speculation acceptance improvement

Infra Optimization

  • NVIDIA MPS for concurrent GPU sharing
  • Batch-size-aware speculation control

System Optimization

  • KVCache-aware throttling
  • Intermediate result pool (decoupling)

Training Optimization

  • SSM distillation (fine-tuning)

Inference Optimization

  • Speculative Decoding
  • Majority Voting
  • Adaptive Speculation Length
  • Pipelined Execution

Reproducibility

Data Available

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Intermediate result pool can accumulate KVCache and trigger swapping or recomputation if not throttled carefully (memory pressure).
  • Requires a set of effective SSMs; poor SSM alignment lowers acceptance and gains.
  • Prototype built on vLLM; adapting to highly optimized frameworks (TensorRT) needs engineering to match peak throughput.

When Not To Use

  • If you already run highly optimized, compiled inference stacks (e.g., TensorRT-LLM) that outperform speculative pipelines in your config.
  • When you cannot host additional SSM instances due to strict memory limits.
  • If domain data prevents effective SSM distillation / fine-tuning and acceptance stays low.

Failure Modes

  • Low acceptance rate from SSM ensemble reduces net speedup or increases LLM verification overhead.
  • Uncontrolled SSM speculation fills GPU memory with KVCache, causing swapping and slower end-to-end performance.
  • Adaptive selector oscillates if decision thresholds are poorly set, causing unstable throughput.

Core Entities

Models

  • Llama2-70B-chat
  • OPT-13B
  • Llama-160M (SSM)
  • OPT-125M (SSM)

Metrics

  • normalized latency (mean latency / output length)
  • throughput (requests/sec)
  • acceptance rate (verified tokens / speculated tokens)

Datasets

  • Empathetic_Dialogues (dialogue)
  • Chatbot Instruction Prompts (chatbot)
  • Finance Alpaca (finance)