Overview
Production Readiness
0.4
Novelty Score
0.5
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
If you serve LLMs from edge nodes, joint per-layer placement and quantization can cut weight storage and network load by ~87% while keeping task accuracy almost unchanged, which lowers cost and reduces latency for latency-sensitive apps.
Summary TLDR
DILEMMA is a framework that jointly decides (a) which edge server runs each LLM layer and (b) how many bits to use per layer. The paper formulates this as an Integer Linear Program that minimizes total inference completion time under resource limits and a weight-difference performance budget. On OPT-350m with SQuAD, the method can reduce layer precision to about 12.5% of original bit usage (≈87.5% size reduction) with very small change in loss (0.0591 → 0.0605 in worst tested settings). The method is solved centrally for the single-LLM case; multi-LLM is NP-hard.
Problem Statement
Edge servers are resource-limited but smart‑city apps want low-latency LLM inference. How do you split an LLM across heterogeneous edge servers and choose per-layer quantization to minimize end-to-end token-by-token completion time while keeping model performance within an error budget?
Main Contribution
Formulate joint per-layer placement and per-layer quantization on heterogeneous edge servers as an ILP (called DILEMMA) that minimizes completion time under resource and performance constraints.
Introduce a practical performance constraint using weight-difference (teacher-student) instead of brute-force metric evaluation; linearize it for the ILP.
Prove the joint optimization is NP-hard for more than one LLM, and solve the single-LLM case with an off‑the‑shelf solver (PuLP).
Empirically evaluate on OPT-350m + SQuAD to show large bit reductions (down to ~12.5% quantization ratio) with minor loss/perplexity changes and study sensitivity to communication and CPU speed.
Key Findings
DILEMMA can reduce total parameter-bit usage to about 12.5% of the original (i.e., ~87.5% reduction).
Model quality changes very little after quantization at evaluated settings.
Joint placement+quantization is NP-hard when serving multiple LLMs.
Results
Loss (original)
Loss (quantized, aggressive / 12.5% ratio)
Perplexity (original → quantized)
Quantization ratio (remaining bits)
Who Should Care
What To Try In 7 Days
Profile an LLM (per-layer FLOPS and output tensor sizes) to estimate weight storage and per-layer transfer sizes.
Run a small ILP (PuLP) for a single-model split across local edge machines to see latency vs storage trade-offs.
Try per-layer truncation quantization (4–8 bits) and measure loss/perplexity on a representative dev set.
Agent Features
Memory
- stores past attention cache (assumed for autoregressive speed model)
Tool Use
- Integer Linear Programming (placement solver)
- Knowledge Distillation (teacher-student weight supervision)
Frameworks
- Python PuLP
Architectures
- two-tier (edge + cloud)
Optimization Features
Token Efficiency
- models autoregressive token rounds (n tokens → n passes considered)
Infra Optimization
- account for device-to-device link speeds
- edge CPU clock speed sensitivity
Model Optimization
- layer-wise quantization
- truncation quantization (per-layer)
- knowledge-distillation guided quantization
System Optimization
- joint placement + quantization ILP
- resource-aware (comm/compute/storage) constraints
Inference Optimization
- distributed layer placement
- minimize token-by-token completion time
Reproducibility
Data Available
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Evaluated only on OPT-350m and SQuAD; generalization to larger LLMs and tasks is untested.
- Solver approach targets single-LLM setups; theorem shows multi-LLM case is NP-hard.
- Performance proxy is weight-difference (teacher-student) rather than end-task metrics for every layer, which may miss some behavioral changes.
- Network model assumes static link speeds and stored attention caches.
When Not To Use
- When you must serve multiple LLMs concurrently (scales poorly; NP-hard).
- When network links or workloads change rapidly (no online/reactive placement shown).
- When you need guaranteed per-metric fidelity across many evaluation metrics without fine-grained validation.
Failure Modes
- ILP becomes intractable as number of models or servers grows.
- Quantization guided by weight difference may not reflect downstream task degradation on other datasets.
- Placement decisions sensitive to link speed and CPU clock; poor estimates hurt latency.
Core Entities
Models
- OPT-350m
Metrics
- loss
- perplexity
- BLEU
- quantization ratio
- completion time
Datasets
- SQuAD

