Jointly place LLM layers on edge servers and quantize them to cut latency and memory while keeping accuracy.

March 3, 20257 min

Overview

Decision SnapshotNeeds Validation

Promising single-LLM results show large storage savings and small quality loss, but evaluation is limited to one model (OPT-350m), one dataset (SQuAD), and a static network; multi-LLM scaling is unaddressed.

Citations0

Evidence Strength0.50

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 3/4

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 60%

Production readiness: 40%

Novelty: 50%

Authors

Minoo Hosseinzadeh, Hana Khamfroush

Links

Abstract / PDF / Data

Why It Matters For Business

If you serve LLMs from edge nodes, joint per-layer placement and quantization can cut weight storage and network load by ~87% while keeping task accuracy almost unchanged, which lowers cost and reduces latency for latency-sensitive apps.

Who Should Care

Summary TLDR

DILEMMA is a framework that jointly decides (a) which edge server runs each LLM layer and (b) how many bits to use per layer. The paper formulates this as an Integer Linear Program that minimizes total inference completion time under resource limits and a weight-difference performance budget. On OPT-350m with SQuAD, the method can reduce layer precision to about 12.5% of original bit usage (≈87.5% size reduction) with very small change in loss (0.0591 → 0.0605 in worst tested settings). The method is solved centrally for the single-LLM case; multi-LLM is NP-hard.

Problem Statement

Edge servers are resource-limited but smart‑city apps want low-latency LLM inference. How do you split an LLM across heterogeneous edge servers and choose per-layer quantization to minimize end-to-end token-by-token completion time while keeping model performance within an error budget?

Main Contribution

Formulate joint per-layer placement and per-layer quantization on heterogeneous edge servers as an ILP (called DILEMMA) that minimizes completion time under resource and performance constraints.

Introduce a practical performance constraint using weight-difference (teacher-student) instead of brute-force metric evaluation; linearize it for the ILP.

Key Findings

DILEMMA can reduce total parameter-bit usage to about 12.5% of the original (i.e., ~87.5% reduction).

Numbersquantization ratio = 12.50% (table rows δ=0.01,0.1,1.0)

Practical UseYou can sharply cut memory and network load on edge nodes; expect ~8× reduction in weight storage for OPT-350m when using aggressive layer-wise quantization.

Evidence RefTable 1 (quant. ratio column)

Model quality changes very little after quantization at evaluated settings.

Numbersloss 0.05910.0605; perplexity 1.06091.0623

Practical UseFor question-answering with OPT-350m on SQuAD, expect negligible loss/perplexity hit when using the paper's quantization+placement choices.

Evidence RefTable 1 (loss and perplexity columns)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Loss (original)0.0591SQuADTable 1 row: original model loss = 0.0591Table 1
Loss (quantized, aggressive / 12.5% ratio)0.0605original 0.0591+0.0014SQuADTable 1 rows for δ ≥ 0.01 show loss = 0.0605 at quant. ratio 12.50%Table 1

What To Try In 7 Days

Profile an LLM (per-layer FLOPS and output tensor sizes) to estimate weight storage and per-layer transfer sizes.

Run a small ILP (PuLP) for a single-model split across local edge machines to see latency vs storage trade-offs.

Try per-layer truncation quantization (4–8 bits) and measure loss/perplexity on a representative dev set.

Agent Features

Memory
stores past attention cache (assumed for autoregressive speed model)
Tool Use
Integer Linear Programming (placement solver)Knowledge Distillation (teacher-student weight supervision)
Frameworks
Python PuLP
Architectures
two-tier (edge + cloud)

Optimization Features

Token Efficiency
models autoregressive token rounds (n tokens → n passes considered)
Infra Optimization
account for device-to-device link speedsedge CPU clock speed sensitivity
Model Optimization
layer-wise quantizationtruncation quantization (per-layer)knowledge-distillation guided quantization
System Optimization
joint placement + quantization ILPresource-aware (comm/compute/storage) constraints
Inference Optimization
distributed layer placementminimize token-by-token completion time

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

Evaluated only on OPT-350m and SQuAD; generalization to larger LLMs and tasks is untested.

Solver approach targets single-LLM setups; theorem shows multi-LLM case is NP-hard.

When Not To Use

When you must serve multiple LLMs concurrently (scales poorly; NP-hard).

When network links or workloads change rapidly (no online/reactive placement shown).

Failure Modes

ILP becomes intractable as number of models or servers grows.

Quantization guided by weight difference may not reflect downstream task degradation on other datasets.

Core Entities

Models

OPT-350m

Metrics

lossperplexityBLEUquantization ratiocompletion time

Datasets

SQuAD