LLM-driven multi-agent system cuts multimodal edge inference latency >80% and boosts fairness to 0.90

February 6, 20268 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

0

Authors

Haiyuan Li, Hari Madhukumar, Shuangyi Yan, Yulei Wu, Dimitra Simeonidou

Links

Abstract / PDF

Why It Matters For Business

You can run multimodal GenAI at the network edge with much lower user latency and fairer access across services, reducing cloud costs and improving user experience without retraining models.

Summary TLDR

This paper presents a two-tier LLM-powered agent system that schedules prompts and manages model deployment across edge servers to optimize latency and fairness for multimodal large-model inference. Tested on a 7-VM city-scale testbed with 4 multimodal models, the system (MA) cuts global mean end-to-end latency to 74 s (over 80% reduction vs naive baselines) and raises normalized Jain fairness to 0.90 (from 0.51). The planner uses episodic memory and natural-language prompts, so it adapts quickly without fine-tuning; planning runs on GPT-5, and short-term agents use GPT-4 APIs.

Problem Statement

Centralized GenAI inference causes high latency, bandwidth costs, privacy, and limited customization. Running diverse multimodal large models at the mobile edge can fix this but introduces scheduling and resource challenges: heterogeneous model resource needs, bursty and varied prompts, limited memory/GPU on nodes, and delayed feedback that breaks conventional controllers. The goal is low end-to-end latency plus fair service across model types under real-world constraints.

Main Contribution

Formulate a joint optimization that balances end-to-end latency and inter-model fairness for multimodal LM inference at the edge.

Design a two-tier LLM-based agent framework: a long-horizon global planning agent and short-horizon prompt-scheduling plus on-node deployment agents.

Build a real-world 7-VM OpenStack/Kubernetes testbed that deploys four multimodal LMs and measures real latencies, queuing, and deployment costs.

Show large improvements in practice: >80% latency reduction and normalized Jain index raised to 0.90, with fast adaptation without model fine-tuning.

Key Findings

Agentic multi-agent system (MA) reduces global mean end-to-end latency to 74 s.

Numbers74 s global mean; >80% reduction vs RF baseline (>406 s)

Service fairness across model types improved from 0.51 to 0.90 (normalized Jain index).

NumbersJain 0.51 → 0.90

MA adapts and stabilizes far faster than a hierarchical DRL baseline.

NumbersMA: 10–30 epochs (~4.17–12.5 h) vs DRL: ~166.7 h (estimated)

The system operates without fine-tuning by using episodic memory and natural-language prompts.

NumbersPlanner uses episodic cases; Tier-1 runs per 50 slots with λ=0.5

Results

Global mean end-to-end latency

Value74 s (MA)

BaselineRF >406 s

Normalized Jain fairness index

Value0.90 (MA)

Baseline0.51 (RF)

Adaptation / convergence time

Value10–30 epochs (~4.17–12.5 hours) (MA)

BaselineHierarchical DRL estimated ~166.7 hours

Who Should Care

What To Try In 7 Days

Instrument an existing edge cluster with Prometheus and per-node telemetry and store it in MongoDB.

Containerize one or two multimodal models and prebuild images to cut cold-start time.

Prototype a simple LLM-based scheduler that outputs JSON routing and test with a low control cadence (e.g., 30s slots).

Agent Features

Memory

  • episodic case memory for in-context learning
  • historical telemetry summaries

Planning

  • long-horizon global planning agent (epoch-level)
  • short-term prompt scheduling agent (slot-level)

Tool Use

  • LLM APIs for reasoning (GPT-5 planner, GPT-4 controllers)
  • Kubernetes API for pod lifecycle
  • FastAPI for control plane

Frameworks

  • LLM-based decision prompts
  • Lyapunov DPP baseline for comparison

Is Agentic

true

Architectures

  • two-tier hierarchical agents
  • central planner + distributed controllers

Collaboration

  • probabilistic routing coordination across MEC nodes
  • node-role intents to avoid incompatible deployments

Optimization Features

Infra Optimization

  • OpenStack + Kubernetes orchestration
  • Prometheus-based monitoring and local MongoDB caching

Model Optimization

  • prebuilt container images to reduce initialization time
  • low_cpu_mem_usage flag (Huggingface) to reduce memory

System Optimization

  • joint latency–fairness objective
  • node-role intents to specialize servers
  • churn penalties to limit frequent reconfiguration

Inference Optimization

  • probabilistic prompt routing to balance load
  • on-demand model activation/deactivation
  • time-sliced GPU vGPUs and device_map parallelism

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Relies on external LLM APIs (GPT-4, GPT-5) which add cost and potential latency.
  • Pod activation/termination is slow and can interrupt in-flight requests.
  • Testbed is modest scale (7 VMs); behavior at very large scale is not shown.

When Not To Use

  • When you require fully air-gapped, on-device orchestration without external LLM access.
  • For ultra-low-latency microsecond control where 30s slot cadence is insufficient.
  • When energy minimization is the primary objective (energy not jointly optimized here).

Failure Modes

  • Planner and local controllers misalign, producing inconsistent deployments and degraded fairness.
  • Pod shutdowns causing stuck pending pods and cascading resource contention.
  • LLM API failures or malformed JSON causing fallback to random policies.

Core Entities

Models

  • GPT2 (137M)
  • GPT2-large (812M)
  • BLIP (470M)
  • Stable Diffusion (890M)
  • GPT-4 (API)
  • GPT-5 (API)

Metrics

  • end-to-end latency
  • normalized Jain fairness index
  • success (service) ratio
  • composite latency-fairness objective

Context Entities

Models

  • time-sliced GPU vGPUs
  • Huggingface device_map model parallelism

Metrics

  • queue backlogs
  • GPU/CPU headroom
  • pod initialization time