Jointly place LLM layers on edge servers and quantize them to cut latency and memory while keeping accuracy.

Overview

Decision SnapshotNeeds Validation

Promising single-LLM results show large storage savings and small quality loss, but evaluation is limited to one model (OPT-350m), one dataset (SQuAD), and a static network; multi-LLM scaling is unaddressed.

Citations0

Evidence Strength0.50

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 3/4

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 60%

Production readiness: 40%

Novelty: 50%

Authors

Minoo Hosseinzadeh, Hana Khamfroush

Links

Abstract / PDF / Data

Why It Matters For Business

If you serve LLMs from edge nodes, joint per-layer placement and quantization can cut weight storage and network load by ~87% while keeping task accuracy almost unchanged, which lowers cost and reduces latency for latency-sensitive apps.

Who Should Care

CTO ML Engineer Engineering Lead Data Scientist Founder

Summary TLDR

DILEMMA is a framework that jointly decides (a) which edge server runs each LLM layer and (b) how many bits to use per layer. The paper formulates this as an Integer Linear Program that minimizes total inference completion time under resource limits and a weight-difference performance budget. On OPT-350m with SQuAD, the method can reduce layer precision to about 12.5% of original bit usage (≈87.5% size reduction) with very small change in loss (0.0591 → 0.0605 in worst tested settings). The method is solved centrally for the single-LLM case; multi-LLM is NP-hard.

Problem Statement

Edge servers are resource-limited but smart‑city apps want low-latency LLM inference. How do you split an LLM across heterogeneous edge servers and choose per-layer quantization to minimize end-to-end token-by-token completion time while keeping model performance within an error budget?

Main Contribution

Formulate joint per-layer placement and per-layer quantization on heterogeneous edge servers as an ILP (called DILEMMA) that minimizes completion time under resource and performance constraints.

Introduce a practical performance constraint using weight-difference (teacher-student) instead of brute-force metric evaluation; linearize it for the ILP.

Key Findings

DILEMMA can reduce total parameter-bit usage to about 12.5% of the original (i.e., ~87.5% reduction).

Numbersquantization ratio = 12.50% (table rows δ=0.01,0.1,1.0)

Practical UseYou can sharply cut memory and network load on edge nodes; expect ~8× reduction in weight storage for OPT-350m when using aggressive layer-wise quantization.

Evidence RefTable 1 (quant. ratio column)

Model quality changes very little after quantization at evaluated settings.

Numbersloss 0.0591 → 0.0605; perplexity 1.0609 → 1.0623

Practical UseFor question-answering with OPT-350m on SQuAD, expect negligible loss/perplexity hit when using the paper's quantization+placement choices.

Evidence RefTable 1 (loss and perplexity columns)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Loss (original)	0.0591	—	—	SQuAD	Table 1 row: original model loss = 0.0591	Table 1
Loss (quantized, aggressive / 12.5% ratio)	0.0605	original 0.0591	+0.0014	SQuAD	Table 1 rows for δ ≥ 0.01 show loss = 0.0605 at quant. ratio 12.50%	Table 1

What To Try In 7 Days

Profile an LLM (per-layer FLOPS and output tensor sizes) to estimate weight storage and per-layer transfer sizes.

Run a small ILP (PuLP) for a single-model split across local edge machines to see latency vs storage trade-offs.

Try per-layer truncation quantization (4–8 bits) and measure loss/perplexity on a representative dev set.

Agent Features

Memory

stores past attention cache (assumed for autoregressive speed model)

Tool Use

Integer Linear Programming (placement solver)Knowledge Distillation (teacher-student weight supervision)

Frameworks

Python PuLP

Architectures

two-tier (edge + cloud)

Optimization Features

Token Efficiency

models autoregressive token rounds (n tokens → n passes considered)

Infra Optimization

account for device-to-device link speedsedge CPU clock speed sensitivity

Model Optimization

layer-wise quantizationtruncation quantization (per-layer)knowledge-distillation guided quantization

System Optimization

joint placement + quantization ILPresource-aware (comm/compute/storage) constraints

Inference Optimization

distributed layer placementminimize token-by-token completion time

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusUnknown

LicenseUnknown

Data URLs

https://rajpurkar.github.io/SQuAD-explorer/

Risks & Boundaries

Limitations

Evaluated only on OPT-350m and SQuAD; generalization to larger LLMs and tasks is untested.

Solver approach targets single-LLM setups; theorem shows multi-LLM case is NP-hard.

When Not To Use

When you must serve multiple LLMs concurrently (scales poorly; NP-hard).

When network links or workloads change rapidly (no online/reactive placement shown).

Failure Modes

ILP becomes intractable as number of models or servers grows.

Quantization guided by weight difference may not reflect downstream task degradation on other datasets.

Core Entities

Models

OPT-350m

Metrics

lossperplexityBLEUquantization ratiocompletion time

Datasets

SQuAD

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

DILEMMA can reduce total parameter-bit usage to about 12.5% of the original (i.e., ~87.5% reduction).

Model quality changes very little after quantization at evaluated settings.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

You May Also Want to Read

Quantizing large multilingual LLMs often hides big drops for non‑Latin languages and hard tasks

Key finding

Systematic benchmark shows small models can reason if trained and compressed carefully

Key finding

Pipeline that combines synthetic-data distillation, LoRA, Muon and GPTQ to make task-specialized LLMs fit on edge devices

Key finding

A practical survey of compression and speed tricks to run large language models on limited hardware

Key finding

Large-scale empirical benchmark showing how attention variants, PEFT, MoE, and int4 quantization trade performance for memory, latency, and

Key finding