Use multi-agent LLM teams to automatically probe and measure prompt leakage

Overview

Decision SnapshotNeeds Validation

The system is a working proof-of-concept with concrete numbers, but it uses a single prompt domain, one agent LLM type, and small trial counts, so more validation is needed before production.

Citations0

Evidence Strength0.70

Confidence0.78

Risk Signals12

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 0/3

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 40%

Novelty: 60%

Authors

Tvrtko Sternak, Davor Runje, Dorian Granoša, Chi Wang

Links

Abstract / PDF / Code

Why It Matters For Business

Prompt leakage can expose business rules and secrets. Measuring leakage with an 'advantage' score helps prioritize defenses and assess whether prompt hardening or guard LLMs are needed.

Who Should Care

CTO ML Engineer Engineering Lead Product Manager Data Scientist

Summary TLDR

This paper defines a formal metric ('advantage') for prompt leakage and implements an agent-based probing system using AG2/AutoGen to automate attacks. The authors run 40-trial experiments with a judge, an initial analysis agent, and a tested agent (all using ChatGPT-4o-mini) on an automotive prompt. Measured advantage values: low security 0.65, medium 0.225, high (with filter guard) 0.1. Code is open-source on GitHub. The work is a proof-of-concept showing advantage quantifies leakage and that simple guard LLMs reduce but do not eliminate leakage.

Problem Statement

System prompts can hide sensitive rules or business secrets. Current testing is mostly manual or ad-hoc. We need an automated, measurable way to find when an LLM leaks parts of its system prompt and to compare defenses.

Main Contribution

A formal definition of prompt-leakage security and an 'advantage' metric to quantify how well an attacker distinguishes original vs sanitized prompts.

A practical, agentic probing framework implemented with AG2/AutoGen using specialized roles (Judge, InitialAnalyser, Tested Agent).

Key Findings

Low-security models leak prompts often.

NumbersAdvantage = 0.65 (Section V)

Practical UseIf you deploy models without hardening, attackers using automated agents can likely extract sensitive prompt details; add defenses before production.

Evidence RefSection V

Basic prompt hardening reduces leakage but still fails often.

NumbersAdvantage = 0.225; ~30% of attacks still revealed data (Section V.A.2)

Practical UsePrompt engineering alone is not sufficient; treat it as partial protection and combine with other controls.

Evidence RefSection V.A.2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
advantage	0.65	—	—	simulated automotive enterprise prompt	Low-security baseline advantage reported in Section V	Section V
advantage	0.225	—	—	simulated automotive enterprise prompt with basic hardening	Medium-security advantage reported and ~30% leakage cases in Section V.A.2	Section V.A.2

What To Try In 7 Days

Run the authors' GitHub probe on a non-sensitive copy of your prompts to get an advantage baseline.

Create a sanitized prompt (replace secrets with plausible substitutes) and measure distinguishability.

Prototype a lightweight guard LLM or output filter and measure advantage reduction.

Agent Features

Planning

judge-driven adaptive questioningiterative probe generation

Tool Use

function calls to prompt_agentGitHub implementation scripts

Frameworks

AG2 (AutoGen)

Is Agentic

Yes

Architectures

multi-agent GroupChat

Collaboration

specialized cooperative roles (Judge, InitialAnalyser, Tested Agent)

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/sternakt/prompt-leakage-probing

Risks & Boundaries

Limitations

Proof-of-concept limited to one domain (automotive prompt) and a single prompt design.

All agents used ChatGPT-4o-mini; results may differ on other model families.

When Not To Use

When you need formal cryptographic guarantees rather than empirical measures.

If you cannot run or afford multi-agent calls to an external LLM.

Failure Modes

Judge bias: judge agent may overfit to detectable markers and misestimate advantage.

False negatives if sanitized prompts unintentionally preserve distinctive phrasing.

Use multi-agent LLM teams to automatically probe and measure prompt leakage

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Low-security models leak prompts often.

Basic prompt hardening reduces leakage but still fails often.

Results

What To Try In 7 Days

Agent Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Low-security models leak prompts often.

Basic prompt hardening reduces leakage but still fails often.

Results

What To Try In 7 Days

Agent Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

You May Also Want to Read

AdversaRiskQA: adversarial factuality benchmark for health, finance, and law

Key finding

Short, natural-looking token sequences can flip LLM judges to say 'Yes' on wrong answers; discovery and a small LoRA defense

Key finding

FACT-BENCH: a 20K-question benchmark that reveals when LLMs forget facts and how exemplars can make them lie

Key finding

RWKU: a stress test for forgetting real-world facts in LLMs using 200 real-person targets and adversarial probes

Key finding

Short adversarial suffixes can flip LLM-as-a-Judge decisions; CUA >30% success

Key finding