LLM-driven multi-agent system cuts multimodal edge inference latency >80% and boosts fairness to 0.90

February 6, 20268 min

Overview

Decision SnapshotNeeds Validation

Implemented on a working 7-VM testbed with measurable gains, but relies on external LLM APIs and a modest-scale deployment; results are strong for practical edge setups tested.

Citations0

Evidence Strength0.70

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/3

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 60%

Authors

Haiyuan Li, Hari Madhukumar, Shuangyi Yan, Yulei Wu, Dimitra Simeonidou

Links

Abstract / PDF

Why It Matters For Business

You can run multimodal GenAI at the network edge with much lower user latency and fairer access across services, reducing cloud costs and improving user experience without retraining models.

Who Should Care

Summary TLDR

This paper presents a two-tier LLM-powered agent system that schedules prompts and manages model deployment across edge servers to optimize latency and fairness for multimodal large-model inference. Tested on a 7-VM city-scale testbed with 4 multimodal models, the system (MA) cuts global mean end-to-end latency to 74 s (over 80% reduction vs naive baselines) and raises normalized Jain fairness to 0.90 (from 0.51). The planner uses episodic memory and natural-language prompts, so it adapts quickly without fine-tuning; planning runs on GPT-5, and short-term agents use GPT-4 APIs.

Problem Statement

Centralized GenAI inference causes high latency, bandwidth costs, privacy, and limited customization. Running diverse multimodal large models at the mobile edge can fix this but introduces scheduling and resource challenges: heterogeneous model resource needs, bursty and varied prompts, limited memory/GPU on nodes, and delayed feedback that breaks conventional controllers. The goal is low end-to-end latency plus fair service across model types under real-world constraints.

Main Contribution

Formulate a joint optimization that balances end-to-end latency and inter-model fairness for multimodal LM inference at the edge.

Design a two-tier LLM-based agent framework: a long-horizon global planning agent and short-horizon prompt-scheduling plus on-node deployment agents.

Key Findings

Agentic multi-agent system (MA) reduces global mean end-to-end latency to 74 s.

Numbers74 s global mean; >80% reduction vs RF baseline (>406 s)

Practical UseUse coordinated LLM agents for routing and node control to cut user-perceived latency on edge GenAI services.

Evidence RefSec VI, Fig.5

Service fairness across model types improved from 0.51 to 0.90 (normalized Jain index).

NumbersJain 0.510.90

Practical UseAdopt fairness-aware routing to avoid persistent under-service of resource-heavy models (e.g., text-to-image).

Evidence RefAbstract; Sec VI, Fig.5

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Global mean end-to-end latency74 s (MA)RF >406 s>80% reductionoverallFig.5 and Sec VISec VI, Fig.5
Normalized Jain fairness index0.90 (MA)0.51 (RF)0.39overallAbstract and Sec VIAbstract; Sec VI, Fig.5

What To Try In 7 Days

Instrument an existing edge cluster with Prometheus and per-node telemetry and store it in MongoDB.

Containerize one or two multimodal models and prebuild images to cut cold-start time.

Prototype a simple LLM-based scheduler that outputs JSON routing and test with a low control cadence (e.g., 30s slots).

Agent Features

Memory
episodic case memory for in-context learninghistorical telemetry summaries
Planning
long-horizon global planning agent (epoch-level)short-term prompt scheduling agent (slot-level)
Tool Use
LLM APIs for reasoning (GPT-5 planner, GPT-4 controllers)Kubernetes API for pod lifecycleFastAPI for control plane
Frameworks
LLM-based decision promptsLyapunov DPP baseline for comparison
Is Agentic

Yes

Architectures
two-tier hierarchical agentscentral planner + distributed controllers
Collaboration
probabilistic routing coordination across MEC nodesnode-role intents to avoid incompatible deployments

Optimization Features

Infra Optimization
OpenStack + Kubernetes orchestrationPrometheus-based monitoring and local MongoDB caching
Model Optimization
prebuilt container images to reduce initialization timelow_cpu_mem_usage flag (Huggingface) to reduce memory
System Optimization
joint latency–fairness objectivenode-role intents to specialize serverschurn penalties to limit frequent reconfiguration
Inference Optimization
probabilistic prompt routing to balance loadon-demand model activation/deactivationtime-sliced GPU vGPUs and device_map parallelism

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

Relies on external LLM APIs (GPT-4, GPT-5) which add cost and potential latency.

Pod activation/termination is slow and can interrupt in-flight requests.

When Not To Use

When you require fully air-gapped, on-device orchestration without external LLM access.

For ultra-low-latency microsecond control where 30s slot cadence is insufficient.

Failure Modes

Planner and local controllers misalign, producing inconsistent deployments and degraded fairness.

Pod shutdowns causing stuck pending pods and cascading resource contention.

Core Entities

Models

GPT2 (137M)GPT2-large (812M)BLIP (470M)Stable Diffusion (890M)GPT-4 (API)GPT-5 (API)

Metrics

end-to-end latencynormalized Jain fairness indexsuccess (service) ratiocomposite latency-fairness objective

Context Entities

Models

time-sliced GPU vGPUsHuggingface device_map model parallelism

Metrics

queue backlogsGPU/CPU headroompod initialization time