Overview
Good practical idea with empirical gains on MixInstruct; evidence is limited to one dataset and an automatic metric, and no code release is provided.
Citations1
Evidence Strength0.50
Confidence0.70
Risk Signals10
Trust Signals
Findings with numeric evidence: 2/3
Findings with evidence refs: 3/3
Results with explicit delta: 2/3
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 80%
Production readiness: 50%
Novelty: 50%
Why It Matters For Business
You can cut ensemble inference cost by roughly 4× while improving automatic quality, making LLM deployment cheaper and more scalable for high-throughput services.
Who Should Care
Summary TLDR
The paper introduces MODI, a practical pipeline that picks a cost-aware subset of open-source LLMs per query. It frames quality vs cost as a bi-objective problem, converts it via an ε budget constraint into a 0/1 knapsack, and uses a DeBERTa regressor to predict per-model quality. On the MixInstruct benchmark, MODI improves automatic quality (BARTScore −2.14) vs prior ensembling (−2.77) while using about 20% of the prior method's FLOP cost.
Problem Statement
Naive ensembling of multiple LLMs raises quality but makes inference expensive and slow. The paper asks: how to select a small subset of diverse open-source models per query to maximize response quality under a user budget?
Main Contribution
Formulate LLM ensembling as a bi-objective quality-vs-cost combinatorial problem.
Apply an ε-constraint to reduce the bi-objective problem to a 0/1 knapsack solved by dynamic programming.
Key Findings
MODI achieves higher automatic-quality than prior ensembling on MixInstruct.
MODI reduces inference FLOP cost substantially compared to LLM-BLENDER.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| BARTScore | MODI −2.14 | LLM-BLENDER −2.77 | +0.63 | MixInstruct (test) | Table 1 reports BARTScore values for MODI and baselines | Table 1 |
| Inference cost (relative) | MODI ≈ 20% of LLM-BLENDER cost | LLM-BLENDER = 100% | ≈ 80% reduction | MixInstruct (per-query FLOP budget) | Table 1 and Section 2.3 state MODI runs at ~20% cost | Table 1, Section 2.3 |
What To Try In 7 Days
Collect 5–10k representative queries and score candidate model outputs with BARTScore.
Train a small DeBERTa regressor to predict per-model quality from queries.
Implement a per-query knapsack solver that selects models under a FLOP budget and fuse outputs with an existing generator.
Agent Features
Tool Use
Frameworks
Optimization Features
Token Efficiency
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Evaluated only on MixInstruct and BARTScore, not on human judgments.
No code release, so engineering effort is needed to reproduce pipeline.
When Not To Use
When you need human-rated quality as the primary objective.
When your query distribution differs strongly from MixInstruct.
Failure Modes
Regressor mispredictions select low-quality models and hurt final output.
Too-small budgets exclude helpful models, lowering quality.

