Overview
Production Readiness
0.5
Novelty Score
0.6
Cost Impact Score
0.5
Citation Count
0
Why It Matters For Business
MultiFuzz finds modest but consistent extra code paths and protocol states in stateful services by using indexed protocol docs and cooperating LLM agents, which can reveal hard-to-reach bugs in production network stacks.
Summary TLDR
MultiFuzz is a system that combines dense retrieval of protocol docs with multiple specialized LLM agents to guide network-protocol fuzzing. It turns RFC text into small 'agentic' chunks, indexes them in a vector DB, and uses crew-style agents (grammar extraction, seed enrichment, plateau-surpassing) to generate protocol-aware packets. On Live555 RTSP, MultiFuzz produced modest but consistent gains in branch coverage and deeper state exploration versus AFLNet, NSFuzz, and ChatAFL in 24-hour runs.
Problem Statement
Traditional fuzzers struggle with deep, stateful protocols because they lack semantic protocol knowledge and use rigid mutations. Single-LLM fuzzers help but suffer hallucinations, unreliable outputs, and limited context use. The paper seeks a more reliable, context-aware fuzzing pipeline that uses protocol specs to guide test generation.
Main Contribution
MultiFuzz: a multi-agent, retrieval-augmented fuzzing framework built on ChatAFL and AFLNet.
Agentic chunking + propositional transformation: convert RFC text into small semantically precise units for embedding and retrieval.
Dense retrieval integration: use a Chroma vector store of RFC chunks to provide protocol-aware context to agents.
Three specialized crews (Grammar Extraction, Seed Enrichment, Coverage Plateau Surpassing) that collaborate via chain-of-thought prompts and tools.
Key Findings
MultiFuzz reached average branch coverage of 2940 branches on Live555 RTSP.
Branch coverage improved +0.9% vs ChatAFL, +2.8% vs AFLNet, +4.7% vs NSFuzz on evaluated runs.
MultiFuzz triggered avg 163.33 valid state transitions, outperforming baselines by 2.3% (ChatAFL) to 94.4% (AFLNet).
MultiFuzz explored on average 14.67 FSM states vs ChatAFL 14.33, AFLNet 10.0, NSFuzz 11.7.
Propositional transformation produced 445 unique propositions from RFC-2326 used as the retrieval substrate.
Results
Branch coverage (average, Live555 RTSP, 24h, 3 runs)
State transitions (average, Live555 RTSP, 24h, 3 runs)
FSM states explored (average, Live555 RTSP, 24h, 3 runs)
Who Should Care
What To Try In 7 Days
Index one protocol's RFCs into a vector DB and run simple retrieval queries to validate recall.
Prototype a small 'seed enrichment' agent that inserts protocol-compliant packets into existing seeds.
Run a 24-hour comparison against your current fuzzer on a test target and compare branches and state transitions.
Agent Features
Memory
- retrieval memory via vector embeddings
Planning
- chain-of-thought style reasoning
- prompt-driven task decomposition
Tool Use
- dense vector DB (Chroma)
- CVE retrieval tool (NVD API)
- Packet/Seeds parsing tools
- Grammar formatting tool
Frameworks
- LangChain
- CrewAI
Is Agentic
true
Architectures
- multi-agent (crew-based)
- retrieval-augmented (RAG)
Collaboration
- specialized crews (Grammar, Seed, Coverage)
- shared context via dense retrieval
Optimization Features
System Optimization
- assignment of sub-tasks to different LLMs to optimize effectiveness
Reproducibility
Data Available
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Evaluation limited to a single protocol implementation (Live555 RTSP).
- Relies on external LLMs and multiple large models, which adds cost and variability.
- No public code release referenced, limiting immediate reproducibility.
- Dense retrieval effectiveness depends on quality of RFC chunking and embeddings.
When Not To Use
- When you must avoid external LLM calls for privacy or compliance reasons.
- On tiny targets where the added complexity and cost outweigh marginal coverage gains.
- If no formal protocol spec (RFC) or reliable documentation exists to index.
Failure Modes
- LLM hallucinations producing invalid or harmful packets.
- Irrelevant retrieval results leading agents astray.
- Model variability causing inconsistent fuzzing performance across runs.
- Increased operational complexity that breaks lightweight CI fuzzing pipelines.
Core Entities
Models
- llama3.3-70b-versatile
- deepseek-r1-distill-llama-70b
- llama370b-8192
- llama-4-scout-17b-16e-instruct
- llama-3.1-8binstant
Metrics
- branch coverage
- number of states
- number of state transitions
- unique crashes
- total paths explored
Datasets
- RFC-2326 (RTSP specification)
- Live555 media streaming server (target implementation)
Benchmarks
- ProFuzzBench

