LLM-driven multi-agent system cuts multimodal edge inference latency >80% and boosts fairness to 0.90

Overview

Decision SnapshotNeeds Validation

Implemented on a working 7-VM testbed with measurable gains, but relies on external LLM APIs and a modest-scale deployment; results are strong for practical edge setups tested.

Citations0

Evidence Strength0.70

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/3

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 60%

Authors

Haiyuan Li, Hari Madhukumar, Shuangyi Yan, Yulei Wu, Dimitra Simeonidou

Links

Abstract / PDF

Why It Matters For Business

You can run multimodal GenAI at the network edge with much lower user latency and fairer access across services, reducing cloud costs and improving user experience without retraining models.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Founder

Summary TLDR

This paper presents a two-tier LLM-powered agent system that schedules prompts and manages model deployment across edge servers to optimize latency and fairness for multimodal large-model inference. Tested on a 7-VM city-scale testbed with 4 multimodal models, the system (MA) cuts global mean end-to-end latency to 74 s (over 80% reduction vs naive baselines) and raises normalized Jain fairness to 0.90 (from 0.51). The planner uses episodic memory and natural-language prompts, so it adapts quickly without fine-tuning; planning runs on GPT-5, and short-term agents use GPT-4 APIs.

Problem Statement

Centralized GenAI inference causes high latency, bandwidth costs, privacy, and limited customization. Running diverse multimodal large models at the mobile edge can fix this but introduces scheduling and resource challenges: heterogeneous model resource needs, bursty and varied prompts, limited memory/GPU on nodes, and delayed feedback that breaks conventional controllers. The goal is low end-to-end latency plus fair service across model types under real-world constraints.

Main Contribution

Formulate a joint optimization that balances end-to-end latency and inter-model fairness for multimodal LM inference at the edge.

Design a two-tier LLM-based agent framework: a long-horizon global planning agent and short-horizon prompt-scheduling plus on-node deployment agents.

Key Findings

Agentic multi-agent system (MA) reduces global mean end-to-end latency to 74 s.

Numbers74 s global mean; >80% reduction vs RF baseline (>406 s)

Practical UseUse coordinated LLM agents for routing and node control to cut user-perceived latency on edge GenAI services.

Evidence RefSec VI, Fig.5

Service fairness across model types improved from 0.51 to 0.90 (normalized Jain index).

NumbersJain 0.51 → 0.90

Practical UseAdopt fairness-aware routing to avoid persistent under-service of resource-heavy models (e.g., text-to-image).

Evidence RefAbstract; Sec VI, Fig.5

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Global mean end-to-end latency	74 s (MA)	RF >406 s	>80% reduction	overall	Fig.5 and Sec VI	Sec VI, Fig.5
Normalized Jain fairness index	0.90 (MA)	0.51 (RF)	↑0.39	overall	Abstract and Sec VI	Abstract; Sec VI, Fig.5

What To Try In 7 Days

Instrument an existing edge cluster with Prometheus and per-node telemetry and store it in MongoDB.

Containerize one or two multimodal models and prebuild images to cut cold-start time.

Prototype a simple LLM-based scheduler that outputs JSON routing and test with a low control cadence (e.g., 30s slots).

Agent Features

Memory

episodic case memory for in-context learninghistorical telemetry summaries

Planning

long-horizon global planning agent (epoch-level)short-term prompt scheduling agent (slot-level)

Tool Use

LLM APIs for reasoning (GPT-5 planner, GPT-4 controllers)Kubernetes API for pod lifecycleFastAPI for control plane

Frameworks

LLM-based decision promptsLyapunov DPP baseline for comparison

Is Agentic

Yes

Architectures

two-tier hierarchical agentscentral planner + distributed controllers

Collaboration

probabilistic routing coordination across MEC nodesnode-role intents to avoid incompatible deployments

Optimization Features

Infra Optimization

OpenStack + Kubernetes orchestrationPrometheus-based monitoring and local MongoDB caching

Model Optimization

prebuilt container images to reduce initialization timelow_cpu_mem_usage flag (Huggingface) to reduce memory

System Optimization

joint latency–fairness objectivenode-role intents to specialize serverschurn penalties to limit frequent reconfiguration

Inference Optimization

probabilistic prompt routing to balance loadon-demand model activation/deactivationtime-sliced GPU vGPUs and device_map parallelism

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

Relies on external LLM APIs (GPT-4, GPT-5) which add cost and potential latency.

Pod activation/termination is slow and can interrupt in-flight requests.

When Not To Use

When you require fully air-gapped, on-device orchestration without external LLM access.

For ultra-low-latency microsecond control where 30s slot cadence is insufficient.

Failure Modes

Planner and local controllers misalign, producing inconsistent deployments and degraded fairness.

Pod shutdowns causing stuck pending pods and cascading resource contention.

Core Entities

Models

GPT2 (137M)GPT2-large (812M)BLIP (470M)Stable Diffusion (890M)GPT-4 (API)GPT-5 (API)

Metrics

end-to-end latencynormalized Jain fairness indexsuccess (service) ratiocomposite latency-fairness objective

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Agentic multi-agent system (MA) reduces global mean end-to-end latency to 74 s.

Service fairness across model types improved from 0.51 to 0.90 (normalized Jain index).

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Context Entities

Models

Metrics

You May Also Want to Read

Chemistry foundation models power structure-focused multimodal RAG inside hierarchical multi-agent workflows

Key finding

Argues that 'agentic' buzzwords mostly rebrand decades-old agent and multi-agent research

Key finding

TRiSM: practical trust, risk and security controls for LLM-based multi-agent systems

Key finding

A dynamic town simulation that tests LLM agents on doing tasks while following local cultural norms

Key finding

A process-aware, auditable multi-agent evaluator that produces more stable, human-aligned scores than a single LLM judge

Key finding