Overview
Implemented on a working 7-VM testbed with measurable gains, but relies on external LLM APIs and a modest-scale deployment; results are strong for practical edge setups tested.
Citations0
Evidence Strength0.70
Confidence0.80
Risk Signals9
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 3/3
Reproducibility
Status: No open assets linked
Open source: Unknown
At A Glance
Cost impact: 70%
Production readiness: 70%
Novelty: 60%
Why It Matters For Business
You can run multimodal GenAI at the network edge with much lower user latency and fairer access across services, reducing cloud costs and improving user experience without retraining models.
Who Should Care
Summary TLDR
This paper presents a two-tier LLM-powered agent system that schedules prompts and manages model deployment across edge servers to optimize latency and fairness for multimodal large-model inference. Tested on a 7-VM city-scale testbed with 4 multimodal models, the system (MA) cuts global mean end-to-end latency to 74 s (over 80% reduction vs naive baselines) and raises normalized Jain fairness to 0.90 (from 0.51). The planner uses episodic memory and natural-language prompts, so it adapts quickly without fine-tuning; planning runs on GPT-5, and short-term agents use GPT-4 APIs.
Problem Statement
Centralized GenAI inference causes high latency, bandwidth costs, privacy, and limited customization. Running diverse multimodal large models at the mobile edge can fix this but introduces scheduling and resource challenges: heterogeneous model resource needs, bursty and varied prompts, limited memory/GPU on nodes, and delayed feedback that breaks conventional controllers. The goal is low end-to-end latency plus fair service across model types under real-world constraints.
Main Contribution
Formulate a joint optimization that balances end-to-end latency and inter-model fairness for multimodal LM inference at the edge.
Design a two-tier LLM-based agent framework: a long-horizon global planning agent and short-horizon prompt-scheduling plus on-node deployment agents.
Key Findings
Agentic multi-agent system (MA) reduces global mean end-to-end latency to 74 s.
Service fairness across model types improved from 0.51 to 0.90 (normalized Jain index).
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Global mean end-to-end latency | 74 s (MA) | RF >406 s | >80% reduction | overall | Fig.5 and Sec VI | Sec VI, Fig.5 |
| Normalized Jain fairness index | 0.90 (MA) | 0.51 (RF) | ↑0.39 | overall | Abstract and Sec VI | Abstract; Sec VI, Fig.5 |
What To Try In 7 Days
Instrument an existing edge cluster with Prometheus and per-node telemetry and store it in MongoDB.
Containerize one or two multimodal models and prebuild images to cut cold-start time.
Prototype a simple LLM-based scheduler that outputs JSON routing and test with a low control cadence (e.g., 30s slots).
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Optimization Features
Infra Optimization
Model Optimization
System Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Relies on external LLM APIs (GPT-4, GPT-5) which add cost and potential latency.
Pod activation/termination is slow and can interrupt in-flight requests.
When Not To Use
When you require fully air-gapped, on-device orchestration without external LLM access.
For ultra-low-latency microsecond control where 30s slot cadence is insufficient.
Failure Modes
Planner and local controllers misalign, producing inconsistent deployments and degraded fairness.
Pod shutdowns causing stuck pending pods and cascading resource contention.

