Route inputs by ensemble agreement to cut inference cost (2–25×) while matching or improving accuracy

July 2, 20248 min

Overview

Production Readiness

0.75

Novelty Score

0.5

Cost Impact Score

0.85

Citation Count

0

Authors

Steven Kolawole, Don Dennis, Ameet Talwalkar, Virginia Smith

Links

Abstract / PDF

Why It Matters For Business

ABC can cut real inference costs quickly by routing easy inputs to cheap models, reducing cloud, network, and API bills while keeping quality.

Summary TLDR

Agreement-Based Cascading (ABC) is a training-free way to build model cascades that routes examples based on agreement among small ensembles. When small models are much cheaper or can be run in parallel, ABC keeps or improves accuracy (usually +1–2 points) while cutting costs: up to 14× communication savings (edge-to-cloud), ~3× GPU rental savings, and 2–25× lower API cost per request on evaluated tasks. ABC needs only ~100 validation samples to set voting thresholds and works as a drop-in replacement for many deployments, but it is not suited to open-ended generation or cases where small/large models cost nearly the same.

Problem Statement

Large models are expensive to run. We need a simple, general way to avoid calling the largest model on easy inputs while preserving accuracy. The paper asks: can we use agreement among small pretrained models as a cheap, reliable deferral signal so fewer inputs reach expensive models?

Main Contribution

Agreement-Based Cascading (ABC): a training-free cascade that defers when ensemble members disagree.

Theoretical analysis: defines "safe deferral rules" that guarantee no accuracy loss and characterizes cost trade-offs (depends on γ and ρ).

Extensive empirical study across vision and language tasks showing accuracy improvements and real-world cost reductions in edge, cloud, and API settings.

Practical calibration method: estimate agreement threshold with ≈100 validation samples.

Key Findings

ABC matches or improves accuracy over the single best model while lowering compute.

NumbersAccuracy +1–2 percentage points on Pareto frontier (Figure 2).

ABC cuts edge-to-cloud communication costs up to 14× on some language tasks.

NumbersUp to 14× reduction in communication latency/cost (SST-2); 5–8× for ImageNet/CIFAR (Figure 4).

ABC reduces cloud GPU rental costs by roughly 3× for image tasks under a heterogeneous placement strategy.

Numbers~3× cost reduction (image tasks) and 10–30% for language tasks (Section 5.2.2, Table 5).

For black-box LLM APIs, voting-based ABC achieves 2–25× reductions in average price per request/token versus SOTA cascades.

Numbers2–25× cost reductions versus baselines in API experiments (Figure 5; Table 1 tier pricing).

Most traffic exits early: a majority of inputs are handled by cheap tiers in practice.

Numbers52–93% of samples processed at early tiers across datasets (Table 5).

Threshold calibration is cheap and stable.

NumbersVoting threshold estimates stabilize using ≈100 validation samples (Figure 6).

Results

Accuracy

Value+1–2 pp

Baselinebest single model at same FLOPs

Communication cost reduction (edge-to-cloud)

Valueup to 14×

Baselinesingle cloud model

GPU rental cost reduction

Value~3× (image tasks)

Baselinebest single model on A100/H100

API price reduction (black-box LLMs)

Value2–25×

Baselinestate-of-the-art cascade baselines

Fraction processed at cheap tiers

Value52–93%

Baselineall samples to big model

Who Should Care

What To Try In 7 Days

Inventory existing pretrained models by size and accuracy and pick a two-level cascade.

Estimate voting threshold θ on ~100 held-out samples to set a safe deferral rule.

Simulate costs with your γ (relative cost) and ρ (parallelism) to predict savings using Table 5 style breakdowns from the paper.

Optimization Features

Token Efficiency

  • API cost-aware routing

Infra Optimization

  • Map tiers to different GPU generations to lower rental cost

Model Optimization

  • Model Cascades
  • Ensembling for robustness

System Optimization

  • Parallel ensemble execution (ρ)
  • Edge placement to reduce communication

Training Optimization

  • No additional training required

Inference Optimization

  • Data-dependent routing
  • Heterogeneous placement (cheap GPUs vs. expensive GPUs)

Reproducibility

Data Urls

  • ImageNet-1K
  • CIFAR-10
  • SST-2
  • SWAG
  • GSM8K
  • CoQA

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Not applicable to open-ended generation tasks in current form.
  • Saves cost only when relative cost γ is small or parallelism ρ is available.
  • Voting can amplify shared biases across models.
  • Sequential ensembles increase latency/cost if small models are not substantially cheaper.

When Not To Use

  • Open-ended generative tasks without a fixed discrete output.
  • Deployments where small and large models have similar cost (γ ≥ 1/5) and no parallelism.
  • Settings with severe distribution shift where validation samples cannot approximate test-time data.

Failure Modes

  • Over-deferral or under-deferral if threshold θ is misestimated on non-representative validation data.
  • High sequential cost if ensemble members run serially and γ is not tiny.
  • Majority voting fails when all small models share correlated errors.

Core Entities

Models

  • LLaMA 3.1
  • Gemma 2
  • Qwen 2
  • ResNet
  • ViT
  • CLIP
  • BERT
  • RoBERTa
  • XLNet
  • ELECTRA

Metrics

  • Accuracy
  • FLOPs
  • selection rate
  • latency
  • GPU cost ($/hour)
  • API price ($/million tokens)
  • F1

Datasets

  • ImageNet-1K
  • CIFAR-10
  • SST-2
  • Twitter Financial News
  • SWAG
  • GSM8K
  • CoQA
  • OVERRULING
  • HEADLINES