Route inputs by ensemble agreement to cut inference cost (2–25×) while matching or improving accuracy

July 2, 20248 min

Overview

Decision SnapshotNeeds Validation

ABC is practical when smaller models are much cheaper or ensembles can be run in parallel; calibration needs ~100 samples and empirical gains are shown across multiple realistic cost models.

Citations0

Evidence Strength0.70

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 5/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 85%

Production readiness: 75%

Novelty: 50%

Authors

Steven Kolawole, Don Dennis, Ameet Talwalkar, Virginia Smith

Links

Abstract / PDF / Data

Why It Matters For Business

ABC can cut real inference costs quickly by routing easy inputs to cheap models, reducing cloud, network, and API bills while keeping quality.

Who Should Care

Summary TLDR

Agreement-Based Cascading (ABC) is a training-free way to build model cascades that routes examples based on agreement among small ensembles. When small models are much cheaper or can be run in parallel, ABC keeps or improves accuracy (usually +1–2 points) while cutting costs: up to 14× communication savings (edge-to-cloud), ~3× GPU rental savings, and 2–25× lower API cost per request on evaluated tasks. ABC needs only ~100 validation samples to set voting thresholds and works as a drop-in replacement for many deployments, but it is not suited to open-ended generation or cases where small/large models cost nearly the same.

Problem Statement

Large models are expensive to run. We need a simple, general way to avoid calling the largest model on easy inputs while preserving accuracy. The paper asks: can we use agreement among small pretrained models as a cheap, reliable deferral signal so fewer inputs reach expensive models?

Main Contribution

Agreement-Based Cascading (ABC): a training-free cascade that defers when ensemble members disagree.

Theoretical analysis: defines "safe deferral rules" that guarantee no accuracy loss and characterizes cost trade-offs (depends on γ and ρ).

Key Findings

ABC matches or improves accuracy over the single best model while lowering compute.

NumbersAccuracy +12 percentage points on Pareto frontier (Figure 2).

Practical UseTry ABC as a drop-in: it can raise accuracy slightly while reducing average compute if you have ensembleable small models.

Evidence Ref§5.1.1 Figure 2

ABC cuts edge-to-cloud communication costs up to 14× on some language tasks.

NumbersUp to 14× reduction in communication latency/cost (SST-2); 5 for ImageNet/CIFAR (Figure 4).

Practical UsePlace small tiers on-device and only send ambiguous inputs to cloud models to reduce network cost and latency.

Evidence Ref§5.2.1 Figure 4

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy+12 ppbest single model at same FLOPs+12 ppvarious vision and language benchmarksABC shifts Pareto frontier; ensembles add small accuracy gains (Figure 2).§5.1.1 Figure 2
Communication cost reduction (edge-to-cloud)up to 14×single cloud model×14 (max)SST-2; ImageNet-1K; CIFAR-10Model placement reduces network transfers; 14× reported on SST-2 (Figure 4).§5.2.1 Figure 4

What To Try In 7 Days

Inventory existing pretrained models by size and accuracy and pick a two-level cascade.

Estimate voting threshold θ on ~100 held-out samples to set a safe deferral rule.

Simulate costs with your γ (relative cost) and ρ (parallelism) to predict savings using Table 5 style breakdowns from the paper.

Optimization Features

Token Efficiency
API cost-aware routing
Infra Optimization
Map tiers to different GPU generations to lower rental cost
Model Optimization
Model CascadesEnsembling for robustness
System Optimization
Parallel ensemble execution (ρ)Edge placement to reduce communication
Training Optimization
No additional training required
Inference Optimization
Data-dependent routingHeterogeneous placement (cheap GPUs vs. expensive GPUs)

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

ImageNet-1KCIFAR-10SST-2SWAGGSM8KCoQA

Risks & Boundaries

Limitations

Not applicable to open-ended generation tasks in current form.

Saves cost only when relative cost γ is small or parallelism ρ is available.

When Not To Use

Open-ended generative tasks without a fixed discrete output.

Deployments where small and large models have similar cost (γ ≥ 1/5) and no parallelism.

Failure Modes

Over-deferral or under-deferral if threshold θ is misestimated on non-representative validation data.

High sequential cost if ensemble members run serially and γ is not tiny.

Core Entities

Models

LLaMA 3.1Gemma 2Qwen 2ResNetViTCLIPBERTRoBERTaXLNetELECTRA

Metrics

AccuracyFLOPsselection ratelatencyGPU cost ($/hour)API price ($/million tokens)F1

Datasets

ImageNet-1KCIFAR-10SST-2Twitter Financial NewsSWAGGSM8KCoQAOVERRULINGHEADLINES