Overview
Production Readiness
0.5
Novelty Score
0.5
Cost Impact Score
0.8
Citation Count
1
Why It Matters For Business
You can cut ensemble inference cost by roughly 4× while improving automatic quality, making LLM deployment cheaper and more scalable for high-throughput services.
Summary TLDR
The paper introduces MODI, a practical pipeline that picks a cost-aware subset of open-source LLMs per query. It frames quality vs cost as a bi-objective problem, converts it via an ε budget constraint into a 0/1 knapsack, and uses a DeBERTa regressor to predict per-model quality. On the MixInstruct benchmark, MODI improves automatic quality (BARTScore −2.14) vs prior ensembling (−2.77) while using about 20% of the prior method's FLOP cost.
Problem Statement
Naive ensembling of multiple LLMs raises quality but makes inference expensive and slow. The paper asks: how to select a small subset of diverse open-source models per query to maximize response quality under a user budget?
Main Contribution
Formulate LLM ensembling as a bi-objective quality-vs-cost combinatorial problem.
Apply an ε-constraint to reduce the bi-objective problem to a 0/1 knapsack solved by dynamic programming.
Build MODI: predict per-model quality with a DeBERTa regressor and run knapsack selection, then fuse selected outputs with an existing GEN-FUSER.
Key Findings
MODI achieves higher automatic-quality than prior ensembling on MixInstruct.
MODI reduces inference FLOP cost substantially compared to LLM-BLENDER.
Framing as an ε-constrained problem turns the selection into a classic knapsack.
Results
BARTScore
Inference cost (relative)
Regression training data
Who Should Care
What To Try In 7 Days
Collect 5–10k representative queries and score candidate model outputs with BARTScore.
Train a small DeBERTa regressor to predict per-model quality from queries.
Implement a per-query knapsack solver that selects models under a FLOP budget and fuse outputs with an existing generator.
Agent Features
Tool Use
- DeBERTa regression for quality prediction
- GEN-FUSER for response fusion
- Knapsack DP for selection
Frameworks
- DeBERTa
- GEN-FUSER
Optimization Features
Token Efficiency
- budget defined in FLOPs/token per query
Infra Optimization
- cuts total FLOPs by selecting fewer models
Model Optimization
- select subset of models per query
System Optimization
- regressor guides selection to trade quality and cost
Training Optimization
- train DeBERTa regressor on 10k examples
Inference Optimization
- budgeted model selection via 0/1 knapsack
- reduce number of model invocations per query
Reproducibility
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Evaluated only on MixInstruct and BARTScore, not on human judgments.
- No code release, so engineering effort is needed to reproduce pipeline.
- Regressor trained on 10k samples may not generalize to other domains.
- Assumes accurate per-model FLOP/token cost estimates.
When Not To Use
- When you need human-rated quality as the primary objective.
- When your query distribution differs strongly from MixInstruct.
- When closed-source models are mandatory or cost estimations are unavailable.
Failure Modes
- Regressor mispredictions select low-quality models and hurt final output.
- Too-small budgets exclude helpful models, lowering quality.
- Fusion errors (GEN-FUSER) can degrade ensemble gains.
Core Entities
Models
- alpaca-native
- vicuna-13b-1.1
- dolly-v2-12b
- stablelm-tuned-alpha-7b
- SFT
- koala-7B-HF
- flan-t5-xxl
- mpt-7b-instruct
Metrics
- BARTScore
Datasets
- Mix-Instruct
Context Entities
Models
- LLM-BLENDER

