Pick subsets of open-source LLMs per query to improve quality while cutting inference cost

December 26, 20235 min

Overview

Decision SnapshotNeeds Validation

Good practical idea with empirical gains on MixInstruct; evidence is limited to one dataset and an automatic metric, and no code release is provided.

Citations1

Evidence Strength0.50

Confidence0.70

Risk Signals10

Trust Signals

Findings with numeric evidence: 2/3

Findings with evidence refs: 3/3

Results with explicit delta: 2/3

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 50%

Novelty: 50%

Authors

Aditi Singla, Aditya Singh, Kanishk Kukreja

Links

Abstract / PDF

Why It Matters For Business

You can cut ensemble inference cost by roughly 4× while improving automatic quality, making LLM deployment cheaper and more scalable for high-throughput services.

Who Should Care

Summary TLDR

The paper introduces MODI, a practical pipeline that picks a cost-aware subset of open-source LLMs per query. It frames quality vs cost as a bi-objective problem, converts it via an ε budget constraint into a 0/1 knapsack, and uses a DeBERTa regressor to predict per-model quality. On the MixInstruct benchmark, MODI improves automatic quality (BARTScore −2.14) vs prior ensembling (−2.77) while using about 20% of the prior method's FLOP cost.

Problem Statement

Naive ensembling of multiple LLMs raises quality but makes inference expensive and slow. The paper asks: how to select a small subset of diverse open-source models per query to maximize response quality under a user budget?

Main Contribution

Formulate LLM ensembling as a bi-objective quality-vs-cost combinatorial problem.

Apply an ε-constraint to reduce the bi-objective problem to a 0/1 knapsack solved by dynamic programming.

Key Findings

MODI achieves higher automatic-quality than prior ensembling on MixInstruct.

NumbersBARTScore: MODI −2.14 vs LLM-BLENDER −2.77+0.63)

Practical UseYou can get measurable quality gains on MixInstruct-style tasks by selecting models per query instead of always fusing all models.

Evidence RefTable 1

MODI reduces inference FLOP cost substantially compared to LLM-BLENDER.

NumbersMODI uses about 20% of LLM-BLENDER's cost

Practical UseIf inference cost limits deployment, applying MODI's budgeted selection can cut costs roughly 4× while maintaining or improving quality.

Evidence RefTable 1, Section 2.3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
BARTScoreMODI −2.14LLM-BLENDER −2.77+0.63MixInstruct (test)Table 1 reports BARTScore values for MODI and baselinesTable 1
Inference cost (relative)MODI ≈ 20% of LLM-BLENDER costLLM-BLENDER = 100%≈ 80% reductionMixInstruct (per-query FLOP budget)Table 1 and Section 2.3 state MODI runs at ~20% costTable 1, Section 2.3

What To Try In 7 Days

Collect 5–10k representative queries and score candidate model outputs with BARTScore.

Train a small DeBERTa regressor to predict per-model quality from queries.

Implement a per-query knapsack solver that selects models under a FLOP budget and fuse outputs with an existing generator.

Agent Features

Tool Use
DeBERTa regression for quality predictionGEN-FUSER for response fusionKnapsack DP for selection
Frameworks
DeBERTaGEN-FUSER

Optimization Features

Token Efficiency
budget defined in FLOPs/token per query
Infra Optimization
cuts total FLOPs by selecting fewer models
Model Optimization
select subset of models per query
System Optimization
regressor guides selection to trade quality and cost
Training Optimization
train DeBERTa regressor on 10k examples
Inference Optimization
budgeted model selection via 0/1 knapsackreduce number of model invocations per query

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Evaluated only on MixInstruct and BARTScore, not on human judgments.

No code release, so engineering effort is needed to reproduce pipeline.

When Not To Use

When you need human-rated quality as the primary objective.

When your query distribution differs strongly from MixInstruct.

Failure Modes

Regressor mispredictions select low-quality models and hurt final output.

Too-small budgets exclude helpful models, lowering quality.

Core Entities

Models

alpaca-nativevicuna-13b-1.1dolly-v2-12bstablelm-tuned-alpha-7bSFTkoala-7B-HFflan-t5-xxlmpt-7b-instruct

Metrics

BARTScore

Datasets

Mix-Instruct

Context Entities

Models

LLM-BLENDER