Jointly place LLM layers on edge servers and quantize them to cut latency and memory while keeping accuracy.

March 3, 20257 min

Overview

Production Readiness

0.4

Novelty Score

0.5

Cost Impact Score

0.6

Citation Count

0

Authors

Minoo Hosseinzadeh, Hana Khamfroush

Links

Abstract / PDF

Why It Matters For Business

If you serve LLMs from edge nodes, joint per-layer placement and quantization can cut weight storage and network load by ~87% while keeping task accuracy almost unchanged, which lowers cost and reduces latency for latency-sensitive apps.

Summary TLDR

DILEMMA is a framework that jointly decides (a) which edge server runs each LLM layer and (b) how many bits to use per layer. The paper formulates this as an Integer Linear Program that minimizes total inference completion time under resource limits and a weight-difference performance budget. On OPT-350m with SQuAD, the method can reduce layer precision to about 12.5% of original bit usage (≈87.5% size reduction) with very small change in loss (0.0591 → 0.0605 in worst tested settings). The method is solved centrally for the single-LLM case; multi-LLM is NP-hard.

Problem Statement

Edge servers are resource-limited but smart‑city apps want low-latency LLM inference. How do you split an LLM across heterogeneous edge servers and choose per-layer quantization to minimize end-to-end token-by-token completion time while keeping model performance within an error budget?

Main Contribution

Formulate joint per-layer placement and per-layer quantization on heterogeneous edge servers as an ILP (called DILEMMA) that minimizes completion time under resource and performance constraints.

Introduce a practical performance constraint using weight-difference (teacher-student) instead of brute-force metric evaluation; linearize it for the ILP.

Prove the joint optimization is NP-hard for more than one LLM, and solve the single-LLM case with an off‑the‑shelf solver (PuLP).

Empirically evaluate on OPT-350m + SQuAD to show large bit reductions (down to ~12.5% quantization ratio) with minor loss/perplexity changes and study sensitivity to communication and CPU speed.

Key Findings

DILEMMA can reduce total parameter-bit usage to about 12.5% of the original (i.e., ~87.5% reduction).

Numbersquantization ratio = 12.50% (table rows δ=0.01,0.1,1.0)

Model quality changes very little after quantization at evaluated settings.

Numbersloss 0.0591 → 0.0605; perplexity 1.0609 → 1.0623

Joint placement+quantization is NP-hard when serving multiple LLMs.

Numberstheorem: NP-hard for >1 LLM (reduction from job-shop scheduling)

Results

Loss (original)

Value0.0591

Loss (quantized, aggressive / 12.5% ratio)

Value0.0605

Baselineoriginal 0.0591

Perplexity (original → quantized)

Value1.0609 → 1.0623

Baselineoriginal 1.0609

Quantization ratio (remaining bits)

Value12.50%

Baseline100% (original precision)

Who Should Care

What To Try In 7 Days

Profile an LLM (per-layer FLOPS and output tensor sizes) to estimate weight storage and per-layer transfer sizes.

Run a small ILP (PuLP) for a single-model split across local edge machines to see latency vs storage trade-offs.

Try per-layer truncation quantization (4–8 bits) and measure loss/perplexity on a representative dev set.

Agent Features

Memory

  • stores past attention cache (assumed for autoregressive speed model)

Tool Use

  • Integer Linear Programming (placement solver)
  • Knowledge Distillation (teacher-student weight supervision)

Frameworks

  • Python PuLP

Architectures

  • two-tier (edge + cloud)

Optimization Features

Token Efficiency

  • models autoregressive token rounds (n tokens → n passes considered)

Infra Optimization

  • account for device-to-device link speeds
  • edge CPU clock speed sensitivity

Model Optimization

  • layer-wise quantization
  • truncation quantization (per-layer)
  • knowledge-distillation guided quantization

System Optimization

  • joint placement + quantization ILP
  • resource-aware (comm/compute/storage) constraints

Inference Optimization

  • distributed layer placement
  • minimize token-by-token completion time

Reproducibility

Data Available

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Evaluated only on OPT-350m and SQuAD; generalization to larger LLMs and tasks is untested.
  • Solver approach targets single-LLM setups; theorem shows multi-LLM case is NP-hard.
  • Performance proxy is weight-difference (teacher-student) rather than end-task metrics for every layer, which may miss some behavioral changes.
  • Network model assumes static link speeds and stored attention caches.

When Not To Use

  • When you must serve multiple LLMs concurrently (scales poorly; NP-hard).
  • When network links or workloads change rapidly (no online/reactive placement shown).
  • When you need guaranteed per-metric fidelity across many evaluation metrics without fine-grained validation.

Failure Modes

  • ILP becomes intractable as number of models or servers grows.
  • Quantization guided by weight difference may not reflect downstream task degradation on other datasets.
  • Placement decisions sensitive to link speed and CPU clock; poor estimates hurt latency.

Core Entities

Models

  • OPT-350m

Metrics

  • loss
  • perplexity
  • BLEU
  • quantization ratio
  • completion time

Datasets

  • SQuAD