Use multi-agent LLM teams to automatically probe and measure prompt leakage

February 18, 20256 min

Overview

Production Readiness

0.4

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

0

Authors

Tvrtko Sternak, Davor Runje, Dorian Granoša, Chi Wang

Links

Abstract / PDF

Why It Matters For Business

Prompt leakage can expose business rules and secrets. Measuring leakage with an 'advantage' score helps prioritize defenses and assess whether prompt hardening or guard LLMs are needed.

Summary TLDR

This paper defines a formal metric ('advantage') for prompt leakage and implements an agent-based probing system using AG2/AutoGen to automate attacks. The authors run 40-trial experiments with a judge, an initial analysis agent, and a tested agent (all using ChatGPT-4o-mini) on an automotive prompt. Measured advantage values: low security 0.65, medium 0.225, high (with filter guard) 0.1. Code is open-source on GitHub. The work is a proof-of-concept showing advantage quantifies leakage and that simple guard LLMs reduce but do not eliminate leakage.

Problem Statement

System prompts can hide sensitive rules or business secrets. Current testing is mostly manual or ad-hoc. We need an automated, measurable way to find when an LLM leaks parts of its system prompt and to compare defenses.

Main Contribution

A formal definition of prompt-leakage security and an 'advantage' metric to quantify how well an attacker distinguishes original vs sanitized prompts.

A practical, agentic probing framework implemented with AG2/AutoGen using specialized roles (Judge, InitialAnalyser, Tested Agent).

Empirical baseline results showing advantage values for three security setups (low/medium/high) on a realistic prompt.

Open-source implementation and instructions published on GitHub for reproducing the probing pipeline.

Key Findings

Low-security models leak prompts often.

NumbersAdvantage = 0.65 (Section V)

Basic prompt hardening reduces leakage but still fails often.

NumbersAdvantage = 0.225; ~30% of attacks still revealed data (Section V.A.2)

A filter/guard LLM lowers detectability substantially.

NumbersAdvantage = 0.10 (Section V)

Results

advantage

Value0.65

advantage

Value0.225

advantage

Value0.10

Who Should Care

What To Try In 7 Days

Run the authors' GitHub probe on a non-sensitive copy of your prompts to get an advantage baseline.

Create a sanitized prompt (replace secrets with plausible substitutes) and measure distinguishability.

Prototype a lightweight guard LLM or output filter and measure advantage reduction.

Agent Features

Planning

  • judge-driven adaptive questioning
  • iterative probe generation

Tool Use

  • function calls to prompt_agent
  • GitHub implementation scripts

Frameworks

  • AG2 (AutoGen)

Is Agentic

true

Architectures

  • multi-agent GroupChat

Collaboration

  • specialized cooperative roles (Judge, InitialAnalyser, Tested Agent)

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Proof-of-concept limited to one domain (automotive prompt) and a single prompt design.
  • All agents used ChatGPT-4o-mini; results may differ on other model families.
  • Experiment size is modest (40 trials per agent), limiting statistical strength.
  • Sanitized-prompt generation and automated substitution is noted but left for future work.
  • Advantage thresholds are user-defined and not universally validated.

When Not To Use

  • When you need formal cryptographic guarantees rather than empirical measures.
  • If you cannot run or afford multi-agent calls to an external LLM.
  • When your threat model requires retraining the base model rather than probing.

Failure Modes

  • Judge bias: judge agent may overfit to detectable markers and misestimate advantage.
  • False negatives if sanitized prompts unintentionally preserve distinctive phrasing.
  • Limited generalization across domains and different LLM architectures.
  • Over-reliance on a single defense (prompt hardening) can give false confidence.

Core Entities

Models

  • ChatGPT-4o-mini

Metrics

  • advantage (distinguishability between original and sanitized prompts)