Use a small-to-large model cascade plus self-generated tests to cut code-completion cost while keeping accuracy.

May 24, 20248 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.8

Citation Count

0

Authors

Boyuan Chen, Mingzhi Zhu, Brendan Dolan-Gavitt, Muhammad Shafique, Siddharth Garg

Links

Abstract / PDF

Why It Matters For Business

If you host code-completion services, cascading can cut inference costs substantially while holding accuracy steady. It is a low-risk, black-box add-on that uses validation to pick cost-aware plans.

Summary TLDR

The authors introduce a black-box model-cascading pipeline for code completion that uses self-generated test cases to decide when to escalate from smaller to larger models. They search for Pareto-optimal combinations of (which model, how many solutions k, how many test lines l, and a threshold θ) on a validation split and then deploy those plans on test data. Across three open-source model families and three code benchmarks, cascading achieves substantial cost savings (paper reports 26% average savings, up to 70% best case on evaluated setups) while matching or improving pass@1 accuracy. The method is black-box (no model weights needed) and geared for production servers with budget-sensitive

Problem Statement

Self-testing (models generate code and tests and pick the best passing solution) raises code accuracy but multiplies inference cost. Servers need a practical, black-box way to trade off accuracy and compute across available model sizes. The paper asks: can we cascade from cheaper models to larger ones and use self-tests to stop early while preserving accuracy and cutting cost?

Main Contribution

A black-box cascading pipeline that queries models from small to large, uses self-generated tests to score candidate solutions, and escalates only when quality falls below a learned threshold.

A validation-driven search that selects Pareto-optimal (cost, accuracy) plans over parameter choices k (answers), l (test lines), and θ (acceptance threshold).

Empirical demonstration across three open-source model families and three code datasets showing large cost savings at equal or better pass@1 accuracy.

Key Findings

Cascading reduces inference cost on evaluated benchmarks, with average savings reported at 26% and up to 70% in the best case.

Numbersavg 26% cost reduction; best-case 70% (paper abstract)

Savings vary by model family: Codegen shows ~70% avg savings on HumanEval; WizardCoder-Python shows 17–31% avg savings depending on dataset.

NumbersCodegen: 70.0% (HumanEval); Wizard-Python: 17.4% (HumanEval), 30.8% (MBPP)

A high acceptance threshold (θ ≈ 0.9 or 1.0) is commonly optimal, but θ=0.8 produced peak accuracy (76.7%) on one setup.

Numbersθ commonly 0.9–1.0; example best accuracy 76.7% at θ=0.8, cost $8.03 per 1k queries

Validation used 30% of available examples to select Pareto plans; test used remaining 70% and reported similar trade-offs.

Numbersvalidation fraction 30%, test fraction 70%

Cascade is black-box and model-agnostic: works with open-source families (Codegen-mono, WizardCoder-V1.0, WizardCoder-Python-V1.0).

Numbersevaluated 3 families; various sizes listed (e.g., Codegen 350M–16B)

Results

Average cost saving vs random single-model self-testing

Valueavg 26% (paper overall)

Baselinerandom single-model self-testing at same accuracy

Cost saving by family (HumanEval / MBPP / APPS-Intro)

ValueWizard-Python: 17.4% / 30.8% / 16.2%; Wizard-V1.0: 11.5% / 11.4% / -1.6%; Codegen: 70.0% / 39.5% / -

Baselinerandom single-model self-testing

Accuracy

Value76.7% accuracy at θ=0.8, cost $8.03 per 1k queries

Baselinesame model family at different θ

Model greedy pass@1 examples (Wizard-Python)

Value7B: 56.7% ; 34B: 72.6%

Baselinegreedy single-model

Who Should Care

What To Try In 7 Days

Reserve 20–30% of your dev prompts as a validation slice and compute Pareto plans over k (answers), l (test lines) and θ.

Implement a small-to-large cascade: query smallest model first, accept if top solution score ≥ θ, otherwise escalate.

Instrument cost per token on your infra (or cloud price) and compare per-query spend of cascade vs single-model baseline.

Optimization Features

Token Efficiency

  • cost per token measurement

Infra Optimization

  • batching to maximize GPU utilization

System Optimization

  • validation-driven Pareto selection
  • threshold-based escalation

Inference Optimization

  • model routing
  • model cascades
  • token budgeting

Reproducibility

Data Urls

  • HumanEval
  • MBPP-sanitized
  • APPS (public datasets referenced in paper)

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • If a small model's accuracy is below ~10%, cascading can waste compute and hurt results; exclude such models.
  • Validation must match test difficulty; the method depends on a representative validation split.
  • Cost numbers derive from RTX 3090 timings and a $0.44/hr price; cloud pricing will change savings estimates.

When Not To Use

  • When your smallest available models have very low accuracy (<10%) on target tasks.
  • When latency constraints demand lowest possible round-trip time (cascade adds potential extra hops).
  • If you cannot hold a representative validation set or measure per-token cost on your infra.

Failure Modes

  • False-positive test lines can cause incorrect acceptance and reduce real accuracy.
  • Overly strict θ causes frequent escalation and increases cost; overly loose θ accepts bad solutions.
  • Validation Pareto points may not generalize if distribution shifts or dataset difficulty changes.

Core Entities

Models

  • Codegen-mono (350M, 2B, 6B, 16B)
  • WizardCoder-V1.0 (1B, 3B, 15B)
  • WizardCoder-Python-V1.0 (7B, 13B, 34B)

Metrics

  • cost per token ($/1M tokens or $/1k queries)
  • Accuracy

Datasets

  • HumanEval
  • MBPP-sanitized
  • APPS-Intro (introductory subset)

Benchmarks

  • pass@1