Pick subsets of open-source LLMs per query to improve quality while cutting inference cost

Overview

Decision SnapshotNeeds Validation

Good practical idea with empirical gains on MixInstruct; evidence is limited to one dataset and an automatic metric, and no code release is provided.

Citations1

Evidence Strength0.50

Confidence0.70

Risk Signals10

Trust Signals

Findings with numeric evidence: 2/3

Findings with evidence refs: 3/3

Results with explicit delta: 2/3

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 50%

Novelty: 50%

Authors

Aditi Singla, Aditya Singh, Kanishk Kukreja

Links

Abstract / PDF

Why It Matters For Business

You can cut ensemble inference cost by roughly 4× while improving automatic quality, making LLM deployment cheaper and more scalable for high-throughput services.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Data Scientist

Summary TLDR

The paper introduces MODI, a practical pipeline that picks a cost-aware subset of open-source LLMs per query. It frames quality vs cost as a bi-objective problem, converts it via an ε budget constraint into a 0/1 knapsack, and uses a DeBERTa regressor to predict per-model quality. On the MixInstruct benchmark, MODI improves automatic quality (BARTScore −2.14) vs prior ensembling (−2.77) while using about 20% of the prior method's FLOP cost.

Problem Statement

Naive ensembling of multiple LLMs raises quality but makes inference expensive and slow. The paper asks: how to select a small subset of diverse open-source models per query to maximize response quality under a user budget?

Main Contribution

Formulate LLM ensembling as a bi-objective quality-vs-cost combinatorial problem.

Apply an ε-constraint to reduce the bi-objective problem to a 0/1 knapsack solved by dynamic programming.

Key Findings

MODI achieves higher automatic-quality than prior ensembling on MixInstruct.

NumbersBARTScore: MODI −2.14 vs LLM-BLENDER −2.77 (Δ +0.63)

Practical UseYou can get measurable quality gains on MixInstruct-style tasks by selecting models per query instead of always fusing all models.

Evidence RefTable 1

MODI reduces inference FLOP cost substantially compared to LLM-BLENDER.

NumbersMODI uses about 20% of LLM-BLENDER's cost

Practical UseIf inference cost limits deployment, applying MODI's budgeted selection can cut costs roughly 4× while maintaining or improving quality.

Evidence RefTable 1, Section 2.3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
BARTScore	MODI −2.14	LLM-BLENDER −2.77	+0.63	MixInstruct (test)	Table 1 reports BARTScore values for MODI and baselines	Table 1
Inference cost (relative)	MODI ≈ 20% of LLM-BLENDER cost	LLM-BLENDER = 100%	≈ 80% reduction	MixInstruct (per-query FLOP budget)	Table 1 and Section 2.3 state MODI runs at ~20% cost	Table 1, Section 2.3

What To Try In 7 Days

Collect 5–10k representative queries and score candidate model outputs with BARTScore.

Train a small DeBERTa regressor to predict per-model quality from queries.

Implement a per-query knapsack solver that selects models under a FLOP budget and fuse outputs with an existing generator.

Agent Features

Tool Use

DeBERTa regression for quality predictionGEN-FUSER for response fusionKnapsack DP for selection

Frameworks

DeBERTaGEN-FUSER

Optimization Features

Token Efficiency

budget defined in FLOPs/token per query

Infra Optimization

cuts total FLOPs by selecting fewer models

Model Optimization

select subset of models per query

System Optimization

regressor guides selection to trade quality and cost

Training Optimization

train DeBERTa regressor on 10k examples

Inference Optimization

budgeted model selection via 0/1 knapsackreduce number of model invocations per query

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

Evaluated only on MixInstruct and BARTScore, not on human judgments.

No code release, so engineering effort is needed to reproduce pipeline.

When Not To Use

When you need human-rated quality as the primary objective.

When your query distribution differs strongly from MixInstruct.

Failure Modes

Regressor mispredictions select low-quality models and hurt final output.

Too-small budgets exclude helpful models, lowering quality.

Core Entities

Models

alpaca-nativevicuna-13b-1.1dolly-v2-12bstablelm-tuned-alpha-7bSFTkoala-7B-HFflan-t5-xxlmpt-7b-instruct

Metrics

BARTScore

Datasets

Mix-Instruct

Context Entities

Models

LLM-BLENDER

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

MODI achieves higher automatic-quality than prior ensembling on MixInstruct.

MODI reduces inference FLOP cost substantially compared to LLM-BLENDER.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Context Entities

Models

You May Also Want to Read

Route queries by model uncertainty (semantic entropy) to cut cloud calls and keep human-preferred quality

Key finding

A large, open benchmark (400K+ instances) that re-evaluates LLM routing and finds many routers match each other while leaving a big gap to a

Key finding

MMR-Bench: measure and optimize per-query model selection for multimodal LLMs under cost budgets

Key finding

RouterEval: a 200M-record benchmark showing router-based model routing can scale LLM performance by combining many weak models

Key finding

RouterBench: dataset + math to measure routing choices that trade cost vs. quality across many LLMs

Key finding