Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
0
Why It Matters For Business
You can run multimodal GenAI at the network edge with much lower user latency and fairer access across services, reducing cloud costs and improving user experience without retraining models.
Summary TLDR
This paper presents a two-tier LLM-powered agent system that schedules prompts and manages model deployment across edge servers to optimize latency and fairness for multimodal large-model inference. Tested on a 7-VM city-scale testbed with 4 multimodal models, the system (MA) cuts global mean end-to-end latency to 74 s (over 80% reduction vs naive baselines) and raises normalized Jain fairness to 0.90 (from 0.51). The planner uses episodic memory and natural-language prompts, so it adapts quickly without fine-tuning; planning runs on GPT-5, and short-term agents use GPT-4 APIs.
Problem Statement
Centralized GenAI inference causes high latency, bandwidth costs, privacy, and limited customization. Running diverse multimodal large models at the mobile edge can fix this but introduces scheduling and resource challenges: heterogeneous model resource needs, bursty and varied prompts, limited memory/GPU on nodes, and delayed feedback that breaks conventional controllers. The goal is low end-to-end latency plus fair service across model types under real-world constraints.
Main Contribution
Formulate a joint optimization that balances end-to-end latency and inter-model fairness for multimodal LM inference at the edge.
Design a two-tier LLM-based agent framework: a long-horizon global planning agent and short-horizon prompt-scheduling plus on-node deployment agents.
Build a real-world 7-VM OpenStack/Kubernetes testbed that deploys four multimodal LMs and measures real latencies, queuing, and deployment costs.
Show large improvements in practice: >80% latency reduction and normalized Jain index raised to 0.90, with fast adaptation without model fine-tuning.
Key Findings
Agentic multi-agent system (MA) reduces global mean end-to-end latency to 74 s.
Service fairness across model types improved from 0.51 to 0.90 (normalized Jain index).
MA adapts and stabilizes far faster than a hierarchical DRL baseline.
The system operates without fine-tuning by using episodic memory and natural-language prompts.
Results
Global mean end-to-end latency
Normalized Jain fairness index
Adaptation / convergence time
Who Should Care
What To Try In 7 Days
Instrument an existing edge cluster with Prometheus and per-node telemetry and store it in MongoDB.
Containerize one or two multimodal models and prebuild images to cut cold-start time.
Prototype a simple LLM-based scheduler that outputs JSON routing and test with a low control cadence (e.g., 30s slots).
Agent Features
Memory
- episodic case memory for in-context learning
- historical telemetry summaries
Planning
- long-horizon global planning agent (epoch-level)
- short-term prompt scheduling agent (slot-level)
Tool Use
- LLM APIs for reasoning (GPT-5 planner, GPT-4 controllers)
- Kubernetes API for pod lifecycle
- FastAPI for control plane
Frameworks
- LLM-based decision prompts
- Lyapunov DPP baseline for comparison
Is Agentic
true
Architectures
- two-tier hierarchical agents
- central planner + distributed controllers
Collaboration
- probabilistic routing coordination across MEC nodes
- node-role intents to avoid incompatible deployments
Optimization Features
Infra Optimization
- OpenStack + Kubernetes orchestration
- Prometheus-based monitoring and local MongoDB caching
Model Optimization
- prebuilt container images to reduce initialization time
- low_cpu_mem_usage flag (Huggingface) to reduce memory
System Optimization
- joint latency–fairness objective
- node-role intents to specialize servers
- churn penalties to limit frequent reconfiguration
Inference Optimization
- probabilistic prompt routing to balance load
- on-demand model activation/deactivation
- time-sliced GPU vGPUs and device_map parallelism
Reproducibility
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Relies on external LLM APIs (GPT-4, GPT-5) which add cost and potential latency.
- Pod activation/termination is slow and can interrupt in-flight requests.
- Testbed is modest scale (7 VMs); behavior at very large scale is not shown.
When Not To Use
- When you require fully air-gapped, on-device orchestration without external LLM access.
- For ultra-low-latency microsecond control where 30s slot cadence is insufficient.
- When energy minimization is the primary objective (energy not jointly optimized here).
Failure Modes
- Planner and local controllers misalign, producing inconsistent deployments and degraded fairness.
- Pod shutdowns causing stuck pending pods and cascading resource contention.
- LLM API failures or malformed JSON causing fallback to random policies.
Core Entities
Models
- GPT2 (137M)
- GPT2-large (812M)
- BLIP (470M)
- Stable Diffusion (890M)
- GPT-4 (API)
- GPT-5 (API)
Metrics
- end-to-end latency
- normalized Jain fairness index
- success (service) ratio
- composite latency-fairness objective
Context Entities
Models
- time-sliced GPU vGPUs
- Huggingface device_map model parallelism
Metrics
- queue backlogs
- GPU/CPU headroom
- pod initialization time

