Overview
Promising single-LLM results show large storage savings and small quality loss, but evaluation is limited to one model (OPT-350m), one dataset (SQuAD), and a static network; multi-LLM scaling is unaddressed.
Citations0
Evidence Strength0.50
Confidence0.80
Risk Signals10
Trust Signals
Findings with numeric evidence: 3/3
Findings with evidence refs: 3/3
Results with explicit delta: 3/4
Reproducibility
Status: Partial assets available
Open source: Unknown
At A Glance
Cost impact: 60%
Production readiness: 40%
Novelty: 50%
Why It Matters For Business
If you serve LLMs from edge nodes, joint per-layer placement and quantization can cut weight storage and network load by ~87% while keeping task accuracy almost unchanged, which lowers cost and reduces latency for latency-sensitive apps.
Who Should Care
Summary TLDR
DILEMMA is a framework that jointly decides (a) which edge server runs each LLM layer and (b) how many bits to use per layer. The paper formulates this as an Integer Linear Program that minimizes total inference completion time under resource limits and a weight-difference performance budget. On OPT-350m with SQuAD, the method can reduce layer precision to about 12.5% of original bit usage (≈87.5% size reduction) with very small change in loss (0.0591 → 0.0605 in worst tested settings). The method is solved centrally for the single-LLM case; multi-LLM is NP-hard.
Problem Statement
Edge servers are resource-limited but smart‑city apps want low-latency LLM inference. How do you split an LLM across heterogeneous edge servers and choose per-layer quantization to minimize end-to-end token-by-token completion time while keeping model performance within an error budget?
Main Contribution
Formulate joint per-layer placement and per-layer quantization on heterogeneous edge servers as an ILP (called DILEMMA) that minimizes completion time under resource and performance constraints.
Introduce a practical performance constraint using weight-difference (teacher-student) instead of brute-force metric evaluation; linearize it for the ILP.
Key Findings
DILEMMA can reduce total parameter-bit usage to about 12.5% of the original (i.e., ~87.5% reduction).
Model quality changes very little after quantization at evaluated settings.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Loss (original) | 0.0591 | — | — | SQuAD | Table 1 row: original model loss = 0.0591 | Table 1 |
| Loss (quantized, aggressive / 12.5% ratio) | 0.0605 | original 0.0591 | +0.0014 | SQuAD | Table 1 rows for δ ≥ 0.01 show loss = 0.0605 at quant. ratio 12.50% | Table 1 |
What To Try In 7 Days
Profile an LLM (per-layer FLOPS and output tensor sizes) to estimate weight storage and per-layer transfer sizes.
Run a small ILP (PuLP) for a single-model split across local edge machines to see latency vs storage trade-offs.
Try per-layer truncation quantization (4–8 bits) and measure loss/perplexity on a representative dev set.
Agent Features
Memory
Tool Use
Frameworks
Architectures
Optimization Features
Token Efficiency
Infra Optimization
Model Optimization
System Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Evaluated only on OPT-350m and SQuAD; generalization to larger LLMs and tasks is untested.
Solver approach targets single-LLM setups; theorem shows multi-LLM case is NP-hard.
When Not To Use
When you must serve multiple LLMs concurrently (scales poorly; NP-hard).
When network links or workloads change rapidly (no online/reactive placement shown).
Failure Modes
ILP becomes intractable as number of models or servers grows.
Quantization guided by weight difference may not reflect downstream task degradation on other datasets.

