10 papers found

Pick subsets of open-source LLMs per query to improve quality while cutting inference cost

0.50
0.50
0.80
1

You can cut ensemble inference cost by roughly 4× while improving automatic quality, making LLM deployment cheaper and more scalable for high-throughput services.

Key finding

MODI achieves higher automatic-quality than prior ensembling on MixInstruct.

Numbers: BARTScore: MODI −2.14 vs LLM-BLENDER −2.77+0.63)

Route inputs by ensemble agreement to cut inference cost (2–25×) while matching or improving accuracy

0.75
0.50
0.85
0

ABC can cut real inference costs quickly by routing easy inputs to cheap models, reducing cloud, network, and API bills while keeping quality.

Key finding

ABC matches or improves accuracy over the single best model while lowering compute.

Numbers: Accuracy +12 percentage points on Pareto frontier (Figure 2).

Preference-conditioned bandit routing that picks the most cost-effective LLM per query

0.70
0.70
0.80
0

You can cut model invocation spend substantially (up to ~27% on benchmarks) without retraining routing logic, onboard new models with 20–50 tests, and add routing with only milliseconds of overhead.

Key finding

Routing reduces inference cost on evaluated benchmarks while keeping similar accuracy.

Numbers: Up to 27% cost reduction (e.g., MMLU) and 11% reduction on AlpacaEval GPT4/Mixtral setting

Radial Networks: token-level routing that skips whole layers to cut compute and latency

0.60
0.70
0.70
0

Routing tokens around low-impact layers can cut average compute and latency per request, reducing serving costs and allowing larger models for the same budget.

Key finding

Per-layer residual contributions shrink as model size grows.

Numbers: OPT-125M median residual ratio ≈ 20%; OPT-66B ≈ 5.9%

Use a small RL router to pick model sizes per request and keep LLM services fast and cheap under bursty load

0.70
0.60
0.80
0

A small learned router can cut GPU costs or delay scaling while keeping user-facing LLM services responsive during bursts, increasing quality-per-GPU and availability.

Key finding

Learned router preserves availability at much higher arrival rates than serving only the large model.

Numbers: Remains available for >10× faster arrival rates than OPT-6.7B (stable workload).

Cut the wasted work: use a big model for intent, a small model for the call, and inject the fixed syntax.

0.75
0.75
0.80
0

HyFunc reduces latency and compute for live API-style agents, enabling faster, cheaper, and more responsive assistants without sacrificing accuracy.

Key finding

HyFunc reduces end-to-end inference latency to 0.828 seconds per case.

Numbers: Latency = 0.828s (HyFunc ♣, Table 1)

Predict problem difficulty from LLM mid-layer embeddings and route each query to the smallest model likely to solve it, cutting compute with

0.60
0.50
0.70
0

Routing saves inference cost by sending easy queries to cheaper models. That lowers cloud bills and lets you scale reasoning services while keeping top-model accuracy.

Key finding

Middle layers of a strong reasoning model carry the most signal for difficulty and correctness prediction.

Use a small-to-large model cascade plus self-generated tests to cut code-completion cost while keeping accuracy.

0.70
0.60
0.80
0

If you host code-completion services, cascading can cut inference costs substantially while holding accuracy steady. It is a low-risk, black-box add-on that uses validation to pick cost-aware plans.

Key finding

Cascading reduces inference cost on evaluated benchmarks, with average savings reported at 26% and up to 70% in the best case.

Numbers: avg 26% cost reduction; best-case 70% (paper abstract)

A large, open benchmark (400K+ instances) that re-evaluates LLM routing and finds many routers match each other while leaving a big gap to a

0.70
0.40
0.70
0

Routing can improve accuracy or cut API cost, but many published routers give similar gains; practical wins come from curated model pools and simple, cheap routers. Always test routers against your Best Single baseline and measure cost/latency together.

Key finding

Models show clear complementarity: no single model dominates all tasks.

Numbers: Table11: dataset-level bests; many datasets led by different models

Use token-level and hidden-state confidence to route queries to smaller models and cut inference cost with little accuracy loss

0.70
0.50
0.80
0

You can cut API and compute bills substantially by asking a small model if it 'knows' and how confident its token choice is, then only escalate uncertain queries to a bigger model.

Key finding

8B→70B cascade matches 70B accuracy with much less compute

Numbers: Acc: 8B70B 83.22% vs 70B 83.57%; Reduced CC 36.46%; PD -0.35%