A systematic, practitioner-focused map of 193 multi-agent security threats and how 16 frameworks cover them

March 9, 20268 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

0

Authors

Tam Nguyen, Moses Ndebugre, Dheeraj Arremsetty

Links

Abstract / PDF

Why It Matters For Business

Multi‑agent systems amplify security and cost risks (data leaks, tool abuse, resource exhaustion) and current frameworks leave blind spots; companies must combine frameworks and add technical controls to avoid regulatory, financial, and operational loss.

Summary TLDR

This paper builds a technical knowledge base of production multi-agent AI systems, uses generative-AI-assisted threat modeling plus expert review to derive a taxonomy of 193 distinct agentic threats across nine categories, and scores 16 security/governance frameworks against every threat. No framework covers most multi-agent gaps: OWASP ASI leads at 65.3% coverage, CDAO GenAI covers development/ops well, and non-determinism and data-leakage channels are the worst‑covered domains. The work is a practical guide to picking frameworks and prioritizing defenses for real multi-agent deployments.

Problem Statement

Multi-agent AI systems (agents that share memory, delegate tools, and coordinate) create new, behavioral attack surfaces not covered well by existing AI or infrastructure frameworks. Practitioners lack a systematic threat taxonomy and cross-framework coverage data to guide secure architecture and tool choice.

Main Contribution

A 193-item taxonomy of security threats unique to production multi-agent AI systems across nine categories.

A reproducible four-phase method: deep system knowledge base, generative-AI-assisted threat modeling, per-threat survey planning, and cross-framework scoring.

A quantitative cross-framework comparison of 16 security and governance frameworks, showing coverage by threat and development lifecycle phase.

Actionable selection guidance: OWASP ASI best overall (65.3% coverage); CDAO GenAI best for development/ops; ATFAA-SHIELD high architectural specificity.

Key Findings

Multi-agent threat taxonomy contains 193 distinct, agent-specific threats.

Numbers193 threat items across 9 categories

Survey evaluated 16 security frameworks against every threat item.

Numbers16 frameworks scored on 193 items

No single framework achieves majority coverage of any single threat category.

NumbersNo majority coverage per category across frameworks

Non-Determinism and Data Leakage are the weakest-covered domains.

NumbersNon-Determinism mean score 1.231; Data Leakage 1.340

Five threat items received no coverage from any reviewed framework.

Numbers5 threat items unaddressed

Results

Threat taxonomy size

Value193 items

Frameworks surveyed

Value16 frameworks

Top framework coverage

Value65.3% (OWASP ASI)

Weakest category mean score

ValueNon-Determinism mean 1.231 (across frameworks)

Under-addressed category mean

ValueData Leakage mean 1.340 (across frameworks)

Uncovered threat items

Value5 items with no framework coverage

Who Should Care

What To Try In 7 Days

Inventory agent surfaces: list agents, shared memories, tool registries, and vector stores.

Run a quick gap matrix: map your controls vs the paper's nine categories and flag non-determinism and data‑leakage gaps.

Add short-term mitigations: per-agent cryptographic identity, signed tool manifests, and per-call least-privilege enforcement.

Agent Features

Memory

  • episodic memory (per-session history)
  • semantic memory (knowledge bases, RAG)
  • working memory / scratchpad
  • KV cache and attention caches

Planning

  • hierarchical planning (HTN)
  • MCTS (Monte Carlo Tree Search)
  • self-consistency and majority voting
  • reflection/critic loops

Tool Use

  • function-calling APIs
  • web agents and scrapers
  • database tools (SQL, vector DBs)
  • cloud/SaaS APIs
  • plugin ecosystems

Frameworks

  • LangChain
  • AutoGen
  • LangGraph
  • CrewAI
  • Semantic Kernel
  • NVIDIA NeMo

Is Agentic

true

Architectures

  • hierarchical agents
  • plan-and-execute
  • reAct
  • swarm/multi-agent ensembles
  • supervisor-worker orchestration

Collaboration

  • inter-agent messaging (GroupChat, AgentCards)
  • orchestrator-based delegation
  • peer trust and reputation
  • shared vector stores

Optimization Features

Token Efficiency

  • context window tuning
  • compression/summarization
  • cache re-use

Infra Optimization

  • autoscaling policies
  • load-balancer routing
  • edge model caching

Model Optimization

  • quantization (INT8/FP16)
  • TensorRT engine build
  • dynamic batching

System Optimization

  • MIG GPU partitions
  • multi-instance GPU sharing
  • container-level caching

Training Optimization

  • LoRA
  • few-shot and fine-tuning pipelines
  • curriculum and replay buffers

Inference Optimization

  • dynamic batching (Triton)
  • speculative decoding
  • replica routing

Reproducibility

Open Source Status

  • no

Risks & Boundaries

Limitations

  • Rapidly evolving field: coverage reflects state at publication and needs frequent updates.
  • Framework scoring mixes governance and technical controls; operational applicability varies by org.
  • No public code or dataset; reproducing the full survey requires the authors' knowledge base.

When Not To Use

  • For simple single-agent chatbots without tool access or persistent memory—overkill.
  • If you need low-latency, single-model microservices where traditional infra controls suffice.
  • When immediate, product-level remediation is required without time for framework layering.

Failure Modes

  • Applying a single framework and assuming full coverage leaves blind spots (non-determinism, planning).
  • Relying only on governance checklists without runtime controls causes detection gaps during streaming and stochastic execution.
  • Combining multiple frameworks without resolving overlaps can create inconsistent controls and audit blind spots.

Core Entities

Models

  • GPT-3
  • GPT-4
  • NVIDIA NeMo

Metrics

  • coverage fraction (framework vs taxonomy)
  • mean per-category framework score
  • OWASP ASI coverage %

Benchmarks

  • ST-WebAgentBench
  • AgentBench

Context Entities

Models

  • ReAct-style LLMs
  • RAG retrievers
  • vision-language models (NeVA)
  • Whisper

Metrics

  • coverage (%)
  • mean framework score per category
  • number of threat items unaddressed

Datasets

  • enterprise RAG corpora
  • evaluation benchmark datasets referenced in frameworks

Benchmarks

  • Pass@K-style reliability metrics
  • CuP (Completion Under Policies) in ST-WebAgentBench