MultiFuzz: dense-retrieval + multi-agent LLMs to push RTSP fuzzing deeper

August 19, 20258 min

Overview

Production Readiness

0.5

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

0

Authors

Youssef Maklad, Fares Wael, Ali Hamdi, Wael Elsersy, Khaled Shaban

Links

Abstract / PDF

Why It Matters For Business

MultiFuzz finds modest but consistent extra code paths and protocol states in stateful services by using indexed protocol docs and cooperating LLM agents, which can reveal hard-to-reach bugs in production network stacks.

Summary TLDR

MultiFuzz is a system that combines dense retrieval of protocol docs with multiple specialized LLM agents to guide network-protocol fuzzing. It turns RFC text into small 'agentic' chunks, indexes them in a vector DB, and uses crew-style agents (grammar extraction, seed enrichment, plateau-surpassing) to generate protocol-aware packets. On Live555 RTSP, MultiFuzz produced modest but consistent gains in branch coverage and deeper state exploration versus AFLNet, NSFuzz, and ChatAFL in 24-hour runs.

Problem Statement

Traditional fuzzers struggle with deep, stateful protocols because they lack semantic protocol knowledge and use rigid mutations. Single-LLM fuzzers help but suffer hallucinations, unreliable outputs, and limited context use. The paper seeks a more reliable, context-aware fuzzing pipeline that uses protocol specs to guide test generation.

Main Contribution

MultiFuzz: a multi-agent, retrieval-augmented fuzzing framework built on ChatAFL and AFLNet.

Agentic chunking + propositional transformation: convert RFC text into small semantically precise units for embedding and retrieval.

Dense retrieval integration: use a Chroma vector store of RFC chunks to provide protocol-aware context to agents.

Three specialized crews (Grammar Extraction, Seed Enrichment, Coverage Plateau Surpassing) that collaborate via chain-of-thought prompts and tools.

Key Findings

MultiFuzz reached average branch coverage of 2940 branches on Live555 RTSP.

Numbersavg branches=2940 (Table I)

Branch coverage improved +0.9% vs ChatAFL, +2.8% vs AFLNet, +4.7% vs NSFuzz on evaluated runs.

NumbersΔ vs ChatAFL=+0.9% | vs AFLNet=+2.8% | vs NSFuzz=+4.7%

MultiFuzz triggered avg 163.33 valid state transitions, outperforming baselines by 2.3% (ChatAFL) to 94.4% (AFLNet).

Numbersavg transitions=163.33; Δ vs AFLNet=+94.4%

MultiFuzz explored on average 14.67 FSM states vs ChatAFL 14.33, AFLNet 10.0, NSFuzz 11.7.

Numbersavg states=14.67; Δ vs AFLNet=+46.7%

Propositional transformation produced 445 unique propositions from RFC-2326 used as the retrieval substrate.

Numbers445 propositions extracted from RFC-2326

Results

Branch coverage (average, Live555 RTSP, 24h, 3 runs)

Value2940 branches

BaselineChatAFL avg 2912.67; AFLNet 2860.0; NSFuzz 2807.0

State transitions (average, Live555 RTSP, 24h, 3 runs)

Value163.33 transitions

BaselineChatAFL 159.67; AFLNet 84.0; NSFuzz 90.33

FSM states explored (average, Live555 RTSP, 24h, 3 runs)

Value14.67 states

BaselineChatAFL 14.33; AFLNet 10.0; NSFuzz 11.7

Who Should Care

What To Try In 7 Days

Index one protocol's RFCs into a vector DB and run simple retrieval queries to validate recall.

Prototype a small 'seed enrichment' agent that inserts protocol-compliant packets into existing seeds.

Run a 24-hour comparison against your current fuzzer on a test target and compare branches and state transitions.

Agent Features

Memory

  • retrieval memory via vector embeddings

Planning

  • chain-of-thought style reasoning
  • prompt-driven task decomposition

Tool Use

  • dense vector DB (Chroma)
  • CVE retrieval tool (NVD API)
  • Packet/Seeds parsing tools
  • Grammar formatting tool

Frameworks

  • LangChain
  • CrewAI

Is Agentic

true

Architectures

  • multi-agent (crew-based)
  • retrieval-augmented (RAG)

Collaboration

  • specialized crews (Grammar, Seed, Coverage)
  • shared context via dense retrieval

Optimization Features

System Optimization

  • assignment of sub-tasks to different LLMs to optimize effectiveness

Reproducibility

Data Available

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Evaluation limited to a single protocol implementation (Live555 RTSP).
  • Relies on external LLMs and multiple large models, which adds cost and variability.
  • No public code release referenced, limiting immediate reproducibility.
  • Dense retrieval effectiveness depends on quality of RFC chunking and embeddings.

When Not To Use

  • When you must avoid external LLM calls for privacy or compliance reasons.
  • On tiny targets where the added complexity and cost outweigh marginal coverage gains.
  • If no formal protocol spec (RFC) or reliable documentation exists to index.

Failure Modes

  • LLM hallucinations producing invalid or harmful packets.
  • Irrelevant retrieval results leading agents astray.
  • Model variability causing inconsistent fuzzing performance across runs.
  • Increased operational complexity that breaks lightweight CI fuzzing pipelines.

Core Entities

Models

  • llama3.3-70b-versatile
  • deepseek-r1-distill-llama-70b
  • llama370b-8192
  • llama-4-scout-17b-16e-instruct
  • llama-3.1-8binstant

Metrics

  • branch coverage
  • number of states
  • number of state transitions
  • unique crashes
  • total paths explored

Datasets

  • RFC-2326 (RTSP specification)
  • Live555 media streaming server (target implementation)

Benchmarks

  • ProFuzzBench