Route inputs by ensemble agreement to cut inference cost (2–25×) while matching or improving accuracy

Overview

Decision SnapshotNeeds Validation

ABC is practical when smaller models are much cheaper or ensembles can be run in parallel; calibration needs ~100 samples and empirical gains are shown across multiple realistic cost models.

Citations0

Evidence Strength0.70

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 5/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 85%

Production readiness: 75%

Novelty: 50%

Authors

Steven Kolawole, Don Dennis, Ameet Talwalkar, Virginia Smith

Links

Abstract / PDF / Data

Why It Matters For Business

ABC can cut real inference costs quickly by routing easy inputs to cheap models, reducing cloud, network, and API bills while keeping quality.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Data Scientist Founder

Summary TLDR

Agreement-Based Cascading (ABC) is a training-free way to build model cascades that routes examples based on agreement among small ensembles. When small models are much cheaper or can be run in parallel, ABC keeps or improves accuracy (usually +1–2 points) while cutting costs: up to 14× communication savings (edge-to-cloud), ~3× GPU rental savings, and 2–25× lower API cost per request on evaluated tasks. ABC needs only ~100 validation samples to set voting thresholds and works as a drop-in replacement for many deployments, but it is not suited to open-ended generation or cases where small/large models cost nearly the same.

Problem Statement

Large models are expensive to run. We need a simple, general way to avoid calling the largest model on easy inputs while preserving accuracy. The paper asks: can we use agreement among small pretrained models as a cheap, reliable deferral signal so fewer inputs reach expensive models?

Main Contribution

Agreement-Based Cascading (ABC): a training-free cascade that defers when ensemble members disagree.

Theoretical analysis: defines "safe deferral rules" that guarantee no accuracy loss and characterizes cost trade-offs (depends on γ and ρ).

Key Findings

ABC matches or improves accuracy over the single best model while lowering compute.

NumbersAccuracy +1–2 percentage points on Pareto frontier (Figure 2).

Practical UseTry ABC as a drop-in: it can raise accuracy slightly while reducing average compute if you have ensembleable small models.

Evidence Ref§5.1.1 Figure 2

ABC cuts edge-to-cloud communication costs up to 14× on some language tasks.

NumbersUp to 14× reduction in communication latency/cost (SST-2); 5–8× for ImageNet/CIFAR (Figure 4).

Practical UsePlace small tiers on-device and only send ambiguous inputs to cloud models to reduce network cost and latency.

Evidence Ref§5.2.1 Figure 4

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	+1–2 pp	best single model at same FLOPs	+1–2 pp	various vision and language benchmarks	ABC shifts Pareto frontier; ensembles add small accuracy gains (Figure 2).	§5.1.1 Figure 2
Communication cost reduction (edge-to-cloud)	up to 14×	single cloud model	×14 (max)	SST-2; ImageNet-1K; CIFAR-10	Model placement reduces network transfers; 14× reported on SST-2 (Figure 4).	§5.2.1 Figure 4

What To Try In 7 Days

Inventory existing pretrained models by size and accuracy and pick a two-level cascade.

Estimate voting threshold θ on ~100 held-out samples to set a safe deferral rule.

Simulate costs with your γ (relative cost) and ρ (parallelism) to predict savings using Table 5 style breakdowns from the paper.

Optimization Features

Token Efficiency

API cost-aware routing

Infra Optimization

Map tiers to different GPU generations to lower rental cost

Model Optimization

Model CascadesEnsembling for robustness

System Optimization

Parallel ensemble execution (ρ)Edge placement to reduce communication

Training Optimization

No additional training required

Inference Optimization

Data-dependent routingHeterogeneous placement (cheap GPUs vs. expensive GPUs)

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

ImageNet-1KCIFAR-10SST-2SWAGGSM8KCoQA

Risks & Boundaries

Limitations

Not applicable to open-ended generation tasks in current form.

Saves cost only when relative cost γ is small or parallelism ρ is available.

When Not To Use

Open-ended generative tasks without a fixed discrete output.

Deployments where small and large models have similar cost (γ ≥ 1/5) and no parallelism.

Failure Modes

Over-deferral or under-deferral if threshold θ is misestimated on non-representative validation data.

High sequential cost if ensemble members run serially and γ is not tiny.

Core Entities

Models

LLaMA 3.1Gemma 2Qwen 2ResNetViTCLIPBERTRoBERTaXLNetELECTRA

Metrics

AccuracyFLOPsselection ratelatencyGPU cost ($/hour)API price ($/million tokens)F1

Datasets

ImageNet-1KCIFAR-10SST-2Twitter Financial NewsSWAGGSM8KCoQAOVERRULINGHEADLINES

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

ABC matches or improves accuracy over the single best model while lowering compute.

ABC cuts edge-to-cloud communication costs up to 14× on some language tasks.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

You May Also Want to Read

Make LLM judges reliable and cheaper: EIF and tuned PPI give tighter, valid evaluation with few labels

Key finding

Pipeline that combines synthetic-data distillation, LoRA, Muon and GPTQ to make task-specialized LLMs fit on edge devices

Key finding

Train a tiny 'judge' on top of target embeddings to accept many more draft tokens and speed up large-model generation up to ~9× without loss

Key finding

Skip 25–30% of expensive FFN blocks to speed decoding while keeping knowledge accuracy

Key finding

Practical survey of quantization, pruning, distillation, and decoding tricks to make LLMs cheaper and faster

Key finding