Use multi-agent LLM teams to automatically probe and measure prompt leakage

February 18, 20256 min

Overview

Decision SnapshotNeeds Validation

The system is a working proof-of-concept with concrete numbers, but it uses a single prompt domain, one agent LLM type, and small trial counts, so more validation is needed before production.

Citations0

Evidence Strength0.70

Confidence0.78

Risk Signals12

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 0/3

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 40%

Novelty: 60%

Authors

Tvrtko Sternak, Davor Runje, Dorian Granoša, Chi Wang

Links

Abstract / PDF / Code

Why It Matters For Business

Prompt leakage can expose business rules and secrets. Measuring leakage with an 'advantage' score helps prioritize defenses and assess whether prompt hardening or guard LLMs are needed.

Who Should Care

Summary TLDR

This paper defines a formal metric ('advantage') for prompt leakage and implements an agent-based probing system using AG2/AutoGen to automate attacks. The authors run 40-trial experiments with a judge, an initial analysis agent, and a tested agent (all using ChatGPT-4o-mini) on an automotive prompt. Measured advantage values: low security 0.65, medium 0.225, high (with filter guard) 0.1. Code is open-source on GitHub. The work is a proof-of-concept showing advantage quantifies leakage and that simple guard LLMs reduce but do not eliminate leakage.

Problem Statement

System prompts can hide sensitive rules or business secrets. Current testing is mostly manual or ad-hoc. We need an automated, measurable way to find when an LLM leaks parts of its system prompt and to compare defenses.

Main Contribution

A formal definition of prompt-leakage security and an 'advantage' metric to quantify how well an attacker distinguishes original vs sanitized prompts.

A practical, agentic probing framework implemented with AG2/AutoGen using specialized roles (Judge, InitialAnalyser, Tested Agent).

Key Findings

Low-security models leak prompts often.

NumbersAdvantage = 0.65 (Section V)

Practical UseIf you deploy models without hardening, attackers using automated agents can likely extract sensitive prompt details; add defenses before production.

Evidence RefSection V

Basic prompt hardening reduces leakage but still fails often.

NumbersAdvantage = 0.225; ~30% of attacks still revealed data (Section V.A.2)

Practical UsePrompt engineering alone is not sufficient; treat it as partial protection and combine with other controls.

Evidence RefSection V.A.2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
advantage0.65simulated automotive enterprise promptLow-security baseline advantage reported in Section VSection V
advantage0.225simulated automotive enterprise prompt with basic hardeningMedium-security advantage reported and ~30% leakage cases in Section V.A.2Section V.A.2

What To Try In 7 Days

Run the authors' GitHub probe on a non-sensitive copy of your prompts to get an advantage baseline.

Create a sanitized prompt (replace secrets with plausible substitutes) and measure distinguishability.

Prototype a lightweight guard LLM or output filter and measure advantage reduction.

Agent Features

Planning
judge-driven adaptive questioningiterative probe generation
Tool Use
function calls to prompt_agentGitHub implementation scripts
Frameworks
AG2 (AutoGen)
Is Agentic

Yes

Architectures
multi-agent GroupChat
Collaboration
specialized cooperative roles (Judge, InitialAnalyser, Tested Agent)

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Proof-of-concept limited to one domain (automotive prompt) and a single prompt design.

All agents used ChatGPT-4o-mini; results may differ on other model families.

When Not To Use

When you need formal cryptographic guarantees rather than empirical measures.

If you cannot run or afford multi-agent calls to an external LLM.

Failure Modes

Judge bias: judge agent may overfit to detectable markers and misestimate advantage.

False negatives if sanitized prompts unintentionally preserve distinctive phrasing.

Core Entities

Models

ChatGPT-4o-mini

Metrics

advantage (distinguishability between original and sanitized prompts)