Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.5
Citation Count
0
Why It Matters For Business
Agentic systems that chain LLMs and tools can be hijacked by hidden instructions in text or images. Adding per-message sanitization, provenance tracking, and output validation reduces attack surface without harming legitimate task accuracy—important for customer-facing automation, finance, and security-sensitive tools.
Summary TLDR
The paper proposes a practical defense that wraps multi-agent pipelines (LangChain/GraphChain) with per-message text and image sanitizers, a provenance ledger, pre-LLM trust masks, and an output validator. In a prototype, detection of multimodal prompt injections rose to 94%, cross-agent trust leakage fell 70%, and benign-task accuracy stayed high (96%). The design is modular and intended to be added to existing agent stacks with moderate runtime cost.
Problem Statement
Agent-based AI pipelines that route text and images through multiple LLM/VLM calls are vulnerable to hidden or paraphrased instructions (prompt injection). Existing defenses (keyword filters, fine-tuning, output filters) miss visual or cross-agent attacks and do not track provenance across agent hops.
Main Contribution
Cross-Agent Multimodal Provenance-Aware Defense: a layered pipeline that sanitizes every incoming text/image, reapplies sanitization before LLM calls, and validates outputs before passing them on.
Four cooperating agents: Text Sanitizer (span-level detection and rewrite), Visual Sanitizer (OCR, metadata, CLIP patch checks), Main Task Model (ML/VLM inference with trust masks), and Output Validator (policy checks and influence attribution).
A provenance ledger that records modality, source, span/patch index, and trust scores per interaction; ledger entries are used to build trust-aware attention masks.
Prototype implementation integrated with LangChain/GraphChain using RoBERTa for text detection, PaddleOCR + CLIP for images, Redis for ledger storage, and compatibility with GPT-4o-mini or open VLMs.
Empirical evaluation comparing the framework to keyword filtering, safety fine-tuning, post-hoc output filtering, and a single-VLM baseline on detection, trust leakage, and benign-task retention.
Key Findings
Multimodal prompt-injection detection rate improved to 94%.
Cross-agent trust leakage was reduced from 0.24 to 0.07 (about 70% reduction).
Benign task performance was preserved at 96% accuracy.
Results
Accuracy
Cross-modal trust leakage
Accuracy
Who Should Care
What To Try In 7 Days
Add an input interceptor that routes all external text/images through a sanitizer before agents see them.
Record lightweight provenance metadata (source, modality, trust tag) alongside messages in a simple key-value store.
Wrap any LLM call with a pre-LLM filter that masks low-trust spans and an output validator that checks for policy violations before tool execution.
Agent Features
Memory
- provenance ledger stores token/patch-level trust and provenance ids
- memory modules store only sanitized content with trust metadata
Planning
- trust-aware routing and masking before LLM inference
Tool Use
- block or permit tool calls based on validator output and provenance
- validator can request regeneration with tighter masks
Frameworks
- LangChain
- GraphChain
Is Agentic
true
Architectures
- multi-agent pipeline (LangChain/GraphChain integration)
- modular agent services (Text Sanitizer, Visual Sanitizer, Validator, Main Model)
Collaboration
- provenance propagation across agent hops
- cross-agent enforcement of trust boundaries
Optimization Features
Token Efficiency
- selective masking removes or attenuates only low-trust spans
Infra Optimization
- in-memory Redis ledger for low-latency provenance lookups
System Optimization
- modular Python services for agents to allow incremental integration
Inference Optimization
- trust-aware attention masking to reduce influence of low-trust tokens
Reproducibility
Open Source Status
- partial
Risks & Boundaries
Limitations
- No public dataset or full evaluation details are shared; exact test scenarios and dataset construction are unspecified.
- Prototype relies on heuristic detectors (pattern rules, RoBERTa detector, stego detector) which may miss novel attacks.
- Provenance ledger itself could be a target (poisoning or tampering) and requires secure deployment.
- Performance overhead described as moderate but no latency numbers are reported.
When Not To Use
- Ultra-low-latency deployments where any extra SAN/validation roundtrip is unacceptable.
- Systems where all inputs are fully trusted and controlled (internal-only closed pipelines).
Failure Modes
- Adversary crafts new visual steganography or paraphrases that bypass the sanitizer detectors.
- Provenance ledger poisoning or incorrect trust assignment leading to false approval.
- False positives redact needed content and degrade downstream task utility.
- Over-reliance on wrapper rules without model-level hardening causes blind spots.
Core Entities
Models
- RoBERTa (pattern detector)
- CLIP (image patch embeddings)
- PaddleOCR (OCR)
- GPT-4o-mini (OpenAI) as exemplar LLM
- LLaVA / BLIP-2 (open VLM options)
Metrics
- Accuracy
- trust leakage (unit reported as ×0.01)
Benchmarks
- keyword filtering baseline
- safety fine-tuning baseline
- post-hoc output filtering baseline
- single-VLM baseline
Context Entities
Datasets
- MM-SafetyBench (cited but not explicitly used)

