63 papers found

Use past benchmark results to learn cheap routers that pick the best LLM for a new task

0.70
0.50
0.70
8

You can often get similar or better task performance while spending less on inference by routing to smaller LLMs selected from past benchmark outputs, and you only need a few labeled examples to improve reliability.

Key finding

OOD-aware score S3 improves selection over best-model-on-average (BMA) on HELM

Numbers: S3 acc=0.694 vs BMA (llama-2-70b) acc=0.688 (Table 1 averages)

Train a cheap router to send 'easy' queries to small models and save cloud cost while keeping quality

0.70
0.60
0.70
6

Route cheap queries to local or smaller models to cut cloud inference costs while keeping user-facing quality high; thresholds let operators trade cost vs quality on demand.

Key finding

The router can route many queries to the small model and keep quality nearly unchanged.

Numbers: 22% fewer large-model calls with <1% BART drop (Llama-2 13b vs GPT-3.5-turbo)

Train small router models on human preferences to halve expensive LLM calls while keeping near-GPT‑4 quality.

0.70
0.50
0.80
4

Train a cheap router on preference data to cut expensive LLM calls >2× while keeping near top-model quality; this reduces cloud bills and lets you scale high-quality features.

Key finding

Routing cuts expensive model calls and reduces cost up to 3.66× on MT Bench while preserving quality.

Numbers: Cost saving ratio MT Bench CPT(50%) = 3.66× (Table 6)

Practical training recipes for MoE LLMs: when to upcycle, and how to diversify experts

0.60
0.60
0.70
3

MoE models can cut compute per-token while keeping high capacity; pick upcycling when you have a good dense checkpoint and limited budget, otherwise invest in from-scratch MoE to maximize final quality.

Key finding

Upcycling vs scratch depends on training budget: with ~2× dense training budget, from-scratch MoE wins; with smaller budgets, upcycling helps.

Numbers: 100B tokens ≈ 2/3 C, 300B tokens ≈ 2 C; scratch outperforms upcycled at 300B budget

One-shot clustering of MLP subunits that preserves NTK to speed up fine-tuning of dense and MoE models

0.70
0.60
0.70
2

MLP Fusion reduces GPU memory and fine-tuning time while preserving training dynamics and near-original accuracy, making low-cost SFT and smaller deployed models feasible for companies running many custom fine-tunes or deploying large MoE models.

Key finding

MLP Fusion yields the lowest NTK approximation error among tested one-shot methods.

Numbers: NTK error on SST2 (RoBERTa first layer): 2826.6 ±155.1 vs SVD 4423.4

Swap transformer blocks for small ordered 'learners' and run only as many as each token needs to cut inference cost with minimal accuracy, n

0.60
0.60
0.70
2

ACMs let you reduce inference compute and GPU latency while retaining model accuracy, enabling cheaper, faster deployment of pretrained transformers in latency- or energy-constrained settings.

Key finding

ACMized ViT-B achieves the Pareto frontier of FLOPs vs accuracy on ImageNet-1k.

Numbers: Advantage especially below 12.5 GFLOPs (Fig.3)

MetaLLM: route each query to the cheapest LLM likely to be correct, cutting cost up to 60% while keeping or improving accuracy

0.70
0.60
0.80
2

MetaLLM reduces API spend by routing easy queries to cheaper models and routes hard queries to stronger models, giving modest accuracy gains and up to ~60% cost reductions versus always using the priciest API.

Key finding

MetaLLM can outperform a mid-tier baseline (text-babbage-001) while keeping the same budget.

Numbers: SST-2: 84.06% vs 82.80% (text-babbage); cost 0.12 per 10k

Train once, assemble many: flexible compression for seq2seq Transformers

0.70
0.70
0.70
1

Train once and ship one compact artifact that can be reconfigured at inference for different memory or latency targets, saving repeated retraining and storage costs while maintaining task quality.

Key finding

A single Modular Transformers training run supports flexible compression from small to large ratios.

Numbers: claimed flexible compression ratios 1.1×–6× (abstract)

Pick subsets of open-source LLMs per query to improve quality while cutting inference cost

0.50
0.50
0.80
1

You can cut ensemble inference cost by roughly 4× while improving automatic quality, making LLM deployment cheaper and more scalable for high-throughput services.

Key finding

MODI achieves higher automatic-quality than prior ensembling on MixInstruct.

Numbers: BARTScore: MODI −2.14 vs LLM-BLENDER −2.77+0.63)

Automatically pick a cheapest mix of GPU types for an LLM service using profiling + an ILP bin-packing solver

0.65
0.50
0.80
1

Picking the right mix of GPU types can cut cloud GPU costs up to ~77% for conversational LLMs while keeping latency targets, lowering monthly infrastructure bills without modifying models or inference logic.

Key finding

GPU cost efficiency (tokens per dollar) varies with request size; no single GPU is best for all sizes.

Numbers: A10G up to 2.6× T/$ over A100 for small requests; A100 up to 1.5× for large requests

Split FFNs into sparse experts + a teacher-guided router to cut FLOPs and adapt LLMs with tiny data

0.50
0.60
0.60
1

FactorLLM can cut FFN compute and lower inference costs significantly while enabling fast domain adaptation with tiny datasets, enabling cheaper, faster deployment for task-specific LLMs.

Key finding

Large FFN FLOPs can be cut heavily by activating fewer experts.

Numbers: FFN GFLOPs reduced ~75% for 1R4E1K

SelectLLM routes each query to a small subset of LLMs to keep accuracy high while cutting inference latency.

0.60
0.50
0.70
1

You can get close-to-ensemble accuracy while calling far fewer models per query, which reduces GPU time and latency and cuts inference costs for reasoning-heavy applications.

Key finding

SELECTLLM (WEIGHTEDMAXCONF) improves accuracy vs. All-LLMs ensembles on two reasoning benchmarks.

Numbers: GSM8K: 76.0477.94 (+1.90); MMLU: 60.9265.81 (+4.89)

UNCURL: cluster-and-merge pruning for Mixture-of-Experts that cuts experts at inference while keeping task accuracy

0.60
0.60
0.70
1

If you plan to deploy SMoE models on multi-GPU setups, pretraining with many experts can improve accuracy but raises inference latency and memory costs; UNCURL gives a practical offline path to shrink experts for tasks while often keeping accuracy.

Key finding

Naïve one-shot pruning by expert activation frequency hurts performance across tasks.

Numbers: Pruned 354M+(32e→8e) lower than 354M+32e on many tasks (Table 1/2)

RouterBench: dataset + math to measure routing choices that trade cost vs. quality across many LLMs

0.60
0.40
0.80
1

Routing can cut serving bills and keep or improve quality by choosing cheaper models per input; RouterBench gives a practical way to measure those trade-offs offline.

Key finding

ROUTERBENCH contains 405,467 labeled LLM outputs across multiple tasks and models.

Numbers: 405,467 samples; 11+ models; 8 datasets

Pick cheaper or stronger solvers per question to cut inference cost while keeping or improving reasoning accuracy.

0.70
0.60
0.80
1

Adaptive per-question solver selection cuts cloud API bills and lets teams trade a small latency increase for big cost or accuracy gains on reasoning workloads.

Key finding

The Adaptive-Solver can cut API costs by a large margin while keeping GPT-4-level accuracy.

Numbers: 46%–85% cost reduction vs GPT-4

Preference-conditioned bandit routing that picks the most cost-effective LLM per query

0.70
0.70
0.80
0

You can cut model invocation spend substantially (up to ~27% on benchmarks) without retraining routing logic, onboard new models with 20–50 tests, and add routing with only milliseconds of overhead.

Key finding

Routing reduces inference cost on evaluated benchmarks while keeping similar accuracy.

Numbers: Up to 27% cost reduction (e.g., MMLU) and 11% reduction on AlpacaEval GPT4/Mixtral setting

Radial Networks: token-level routing that skips whole layers to cut compute and latency

0.60
0.70
0.70
0

Routing tokens around low-impact layers can cut average compute and latency per request, reducing serving costs and allowing larger models for the same budget.

Key finding

Per-layer residual contributions shrink as model size grows.

Numbers: OPT-125M median residual ratio ≈ 20%; OPT-66B ≈ 5.9%

CMoE: turn dense FFNs into MoE in minutes to get ~1.4–1.6× end-to-end speedups

0.70
0.60
0.70
0

CMoE lets teams cut LLM inference cost quickly by turning FFNs sparse without long retraining, enabling faster deployment and cheaper serving while allowing small targeted fine-tunes to regain quality.

Key finding

Training-free CMoE is usable immediately: 25% activation gives practical perplexity without training.

Numbers: WikiText-2 PPL: Dense 5.27, CMoE 25% TF 62.30; CMoE 25% FT 12.73 (Table 1)

Train one hybrid reasoning model, get many deployable sizes for free

0.70
0.60
0.80
0

You can train one large reasoning model and ship multiple quality/latency variants without per‑size retraining. That cuts token costs and storage needs, simplifies model ops, and makes offering multiple service tiers cheaper.

Key finding

Derive 6B and 9B models from a single 12B run using 110B training tokens.

Numbers: 110B tokens total (Table 2)

Use a small RL router to pick model sizes per request and keep LLM services fast and cheap under bursty load

0.70
0.60
0.80
0

A small learned router can cut GPU costs or delay scaling while keeping user-facing LLM services responsive during bursts, increasing quality-per-GPU and availability.

Key finding

Learned router preserves availability at much higher arrival rates than serving only the large model.

Numbers: Remains available for >10× faster arrival rates than OPT-6.7B (stable workload).

Mix GPU types, tune deployments, and route workloads to cut LLM serving cost and boost throughput

0.70
0.50
0.80
0

Mixing GPU types and jointly optimizing how models are deployed and routed can process more requests or cut tail latency for the same hourly cloud spend, making LLM products cheaper and more scalable.

Key finding

Picking the right mix of GPU types improves cost-efficiency versus a homogeneous fleet.

Numbers: up to 2.27× improvement in throughput-per-cost (benchmarking)

Make large Mixture-of-Experts models run faster on edge GPUs by prefetching experts using adjacent-layer gate inputs

0.60
0.60
0.70
0

Fate cuts MoE inference latency on edge-class GPUs by combining cross-layer prefetch, targeted caching, and hybrid INT2/INT4 transfers—enabling richer, privacy-friendly on-device LLM features with small accuracy trade-offs.

Key finding

Cross-layer prefetch achieves very high prediction accuracy without retraining.

Numbers: 97.15% prefetch accuracy (by transferring experts above 75th-confidence percentile)

ShardMemo: budgeted, scope-correct sharded memory using masked MoE routing

0.70
0.60
0.70
0

ShardMemo reduces retrieval cost and tail latency while improving accuracy for agent workflows, making LLM-based agents faster and more reliable under budgeted memory access.

Key finding

ShardMemo improves LoCoMo QA F1 over the strongest baseline (GAM).

Numbers: Single-hop F1 64.08 vs 58.38 (+5.70) (Table 1)

MMR-Bench: measure and optimize per-query model selection for multimodal LLMs under cost budgets

0.70
0.60
0.85
0

Routing multimodal queries to different models by complexity cuts average inference cost substantially while preserving top-model accuracy, letting teams save compute or serve more queries under fixed budgets.

Key finding

Multimodal routing matches or exceeds the accuracy of the strongest single model at roughly one-third of its cost on evaluated benchmarks.

Numbers: Routed system ≈ same accuracy as best single model at ~33% cost (Sec.5.3, Fig.4)