Pick subsets of open-source LLMs per query to improve quality while cutting inference cost

December 26, 20235 min

Overview

Production Readiness

0.5

Novelty Score

0.5

Cost Impact Score

0.8

Citation Count

1

Authors

Aditi Singla, Aditya Singh, Kanishk Kukreja

Links

Abstract / PDF

Why It Matters For Business

You can cut ensemble inference cost by roughly 4× while improving automatic quality, making LLM deployment cheaper and more scalable for high-throughput services.

Summary TLDR

The paper introduces MODI, a practical pipeline that picks a cost-aware subset of open-source LLMs per query. It frames quality vs cost as a bi-objective problem, converts it via an ε budget constraint into a 0/1 knapsack, and uses a DeBERTa regressor to predict per-model quality. On the MixInstruct benchmark, MODI improves automatic quality (BARTScore −2.14) vs prior ensembling (−2.77) while using about 20% of the prior method's FLOP cost.

Problem Statement

Naive ensembling of multiple LLMs raises quality but makes inference expensive and slow. The paper asks: how to select a small subset of diverse open-source models per query to maximize response quality under a user budget?

Main Contribution

Formulate LLM ensembling as a bi-objective quality-vs-cost combinatorial problem.

Apply an ε-constraint to reduce the bi-objective problem to a 0/1 knapsack solved by dynamic programming.

Build MODI: predict per-model quality with a DeBERTa regressor and run knapsack selection, then fuse selected outputs with an existing GEN-FUSER.

Key Findings

MODI achieves higher automatic-quality than prior ensembling on MixInstruct.

NumbersBARTScore: MODI −2.14 vs LLM-BLENDER −2.77 (Δ +0.63)

MODI reduces inference FLOP cost substantially compared to LLM-BLENDER.

NumbersMODI uses about 20% of LLM-BLENDER's cost

Framing as an ε-constrained problem turns the selection into a classic knapsack.

Results

BARTScore

ValueMODI −2.14

BaselineLLM-BLENDER −2.77

Inference cost (relative)

ValueMODI ≈ 20% of LLM-BLENDER cost

BaselineLLM-BLENDER = 100%

Regression training data

Value10k queries

Who Should Care

What To Try In 7 Days

Collect 5–10k representative queries and score candidate model outputs with BARTScore.

Train a small DeBERTa regressor to predict per-model quality from queries.

Implement a per-query knapsack solver that selects models under a FLOP budget and fuse outputs with an existing generator.

Agent Features

Tool Use

  • DeBERTa regression for quality prediction
  • GEN-FUSER for response fusion
  • Knapsack DP for selection

Frameworks

  • DeBERTa
  • GEN-FUSER

Optimization Features

Token Efficiency

  • budget defined in FLOPs/token per query

Infra Optimization

  • cuts total FLOPs by selecting fewer models

Model Optimization

  • select subset of models per query

System Optimization

  • regressor guides selection to trade quality and cost

Training Optimization

  • train DeBERTa regressor on 10k examples

Inference Optimization

  • budgeted model selection via 0/1 knapsack
  • reduce number of model invocations per query

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Evaluated only on MixInstruct and BARTScore, not on human judgments.
  • No code release, so engineering effort is needed to reproduce pipeline.
  • Regressor trained on 10k samples may not generalize to other domains.
  • Assumes accurate per-model FLOP/token cost estimates.

When Not To Use

  • When you need human-rated quality as the primary objective.
  • When your query distribution differs strongly from MixInstruct.
  • When closed-source models are mandatory or cost estimations are unavailable.

Failure Modes

  • Regressor mispredictions select low-quality models and hurt final output.
  • Too-small budgets exclude helpful models, lowering quality.
  • Fusion errors (GEN-FUSER) can degrade ensemble gains.

Core Entities

Models

  • alpaca-native
  • vicuna-13b-1.1
  • dolly-v2-12b
  • stablelm-tuned-alpha-7b
  • SFT
  • koala-7B-HF
  • flan-t5-xxl
  • mpt-7b-instruct

Metrics

  • BARTScore

Datasets

  • Mix-Instruct

Context Entities

Models

  • LLM-BLENDER