Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
0
Why It Matters For Business
Multi‑agent systems amplify security and cost risks (data leaks, tool abuse, resource exhaustion) and current frameworks leave blind spots; companies must combine frameworks and add technical controls to avoid regulatory, financial, and operational loss.
Summary TLDR
This paper builds a technical knowledge base of production multi-agent AI systems, uses generative-AI-assisted threat modeling plus expert review to derive a taxonomy of 193 distinct agentic threats across nine categories, and scores 16 security/governance frameworks against every threat. No framework covers most multi-agent gaps: OWASP ASI leads at 65.3% coverage, CDAO GenAI covers development/ops well, and non-determinism and data-leakage channels are the worst‑covered domains. The work is a practical guide to picking frameworks and prioritizing defenses for real multi-agent deployments.
Problem Statement
Multi-agent AI systems (agents that share memory, delegate tools, and coordinate) create new, behavioral attack surfaces not covered well by existing AI or infrastructure frameworks. Practitioners lack a systematic threat taxonomy and cross-framework coverage data to guide secure architecture and tool choice.
Main Contribution
A 193-item taxonomy of security threats unique to production multi-agent AI systems across nine categories.
A reproducible four-phase method: deep system knowledge base, generative-AI-assisted threat modeling, per-threat survey planning, and cross-framework scoring.
A quantitative cross-framework comparison of 16 security and governance frameworks, showing coverage by threat and development lifecycle phase.
Actionable selection guidance: OWASP ASI best overall (65.3% coverage); CDAO GenAI best for development/ops; ATFAA-SHIELD high architectural specificity.
Key Findings
Multi-agent threat taxonomy contains 193 distinct, agent-specific threats.
Survey evaluated 16 security frameworks against every threat item.
No single framework achieves majority coverage of any single threat category.
Non-Determinism and Data Leakage are the weakest-covered domains.
Five threat items received no coverage from any reviewed framework.
Results
Threat taxonomy size
Frameworks surveyed
Top framework coverage
Weakest category mean score
Under-addressed category mean
Uncovered threat items
Who Should Care
What To Try In 7 Days
Inventory agent surfaces: list agents, shared memories, tool registries, and vector stores.
Run a quick gap matrix: map your controls vs the paper's nine categories and flag non-determinism and data‑leakage gaps.
Add short-term mitigations: per-agent cryptographic identity, signed tool manifests, and per-call least-privilege enforcement.
Agent Features
Memory
- episodic memory (per-session history)
- semantic memory (knowledge bases, RAG)
- working memory / scratchpad
- KV cache and attention caches
Planning
- hierarchical planning (HTN)
- MCTS (Monte Carlo Tree Search)
- self-consistency and majority voting
- reflection/critic loops
Tool Use
- function-calling APIs
- web agents and scrapers
- database tools (SQL, vector DBs)
- cloud/SaaS APIs
- plugin ecosystems
Frameworks
- LangChain
- AutoGen
- LangGraph
- CrewAI
- Semantic Kernel
- NVIDIA NeMo
Is Agentic
true
Architectures
- hierarchical agents
- plan-and-execute
- reAct
- swarm/multi-agent ensembles
- supervisor-worker orchestration
Collaboration
- inter-agent messaging (GroupChat, AgentCards)
- orchestrator-based delegation
- peer trust and reputation
- shared vector stores
Optimization Features
Token Efficiency
- context window tuning
- compression/summarization
- cache re-use
Infra Optimization
- autoscaling policies
- load-balancer routing
- edge model caching
Model Optimization
- quantization (INT8/FP16)
- TensorRT engine build
- dynamic batching
System Optimization
- MIG GPU partitions
- multi-instance GPU sharing
- container-level caching
Training Optimization
- LoRA
- few-shot and fine-tuning pipelines
- curriculum and replay buffers
Inference Optimization
- dynamic batching (Triton)
- speculative decoding
- replica routing
Reproducibility
Open Source Status
- no
Risks & Boundaries
Limitations
- Rapidly evolving field: coverage reflects state at publication and needs frequent updates.
- Framework scoring mixes governance and technical controls; operational applicability varies by org.
- No public code or dataset; reproducing the full survey requires the authors' knowledge base.
When Not To Use
- For simple single-agent chatbots without tool access or persistent memory—overkill.
- If you need low-latency, single-model microservices where traditional infra controls suffice.
- When immediate, product-level remediation is required without time for framework layering.
Failure Modes
- Applying a single framework and assuming full coverage leaves blind spots (non-determinism, planning).
- Relying only on governance checklists without runtime controls causes detection gaps during streaming and stochastic execution.
- Combining multiple frameworks without resolving overlaps can create inconsistent controls and audit blind spots.
Core Entities
Models
- GPT-3
- GPT-4
- NVIDIA NeMo
Metrics
- coverage fraction (framework vs taxonomy)
- mean per-category framework score
- OWASP ASI coverage %
Benchmarks
- ST-WebAgentBench
- AgentBench
Context Entities
Models
- ReAct-style LLMs
- RAG retrievers
- vision-language models (NeVA)
- Whisper
Metrics
- coverage (%)
- mean framework score per category
- number of threat items unaddressed
Datasets
- enterprise RAG corpora
- evaluation benchmark datasets referenced in frameworks
Benchmarks
- Pass@K-style reliability metrics
- CuP (Completion Under Policies) in ST-WebAgentBench

