Overview
The system is a working proof-of-concept with concrete numbers, but it uses a single prompt domain, one agent LLM type, and small trial counts, so more validation is needed before production.
Citations0
Evidence Strength0.70
Confidence0.78
Risk Signals12
Trust Signals
Findings with numeric evidence: 3/3
Findings with evidence refs: 3/3
Results with explicit delta: 0/3
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 50%
Production readiness: 40%
Novelty: 60%
Why It Matters For Business
Prompt leakage can expose business rules and secrets. Measuring leakage with an 'advantage' score helps prioritize defenses and assess whether prompt hardening or guard LLMs are needed.
Who Should Care
Summary TLDR
This paper defines a formal metric ('advantage') for prompt leakage and implements an agent-based probing system using AG2/AutoGen to automate attacks. The authors run 40-trial experiments with a judge, an initial analysis agent, and a tested agent (all using ChatGPT-4o-mini) on an automotive prompt. Measured advantage values: low security 0.65, medium 0.225, high (with filter guard) 0.1. Code is open-source on GitHub. The work is a proof-of-concept showing advantage quantifies leakage and that simple guard LLMs reduce but do not eliminate leakage.
Problem Statement
System prompts can hide sensitive rules or business secrets. Current testing is mostly manual or ad-hoc. We need an automated, measurable way to find when an LLM leaks parts of its system prompt and to compare defenses.
Main Contribution
A formal definition of prompt-leakage security and an 'advantage' metric to quantify how well an attacker distinguishes original vs sanitized prompts.
A practical, agentic probing framework implemented with AG2/AutoGen using specialized roles (Judge, InitialAnalyser, Tested Agent).
Key Findings
Low-security models leak prompts often.
Basic prompt hardening reduces leakage but still fails often.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| advantage | 0.65 | — | — | simulated automotive enterprise prompt | Low-security baseline advantage reported in Section V | Section V |
| advantage | 0.225 | — | — | simulated automotive enterprise prompt with basic hardening | Medium-security advantage reported and ~30% leakage cases in Section V.A.2 | Section V.A.2 |
What To Try In 7 Days
Run the authors' GitHub probe on a non-sensitive copy of your prompts to get an advantage baseline.
Create a sanitized prompt (replace secrets with plausible substitutes) and measure distinguishability.
Prototype a lightweight guard LLM or output filter and measure advantage reduction.
Agent Features
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Reproducibility
Risks & Boundaries
Limitations
Proof-of-concept limited to one domain (automotive prompt) and a single prompt design.
All agents used ChatGPT-4o-mini; results may differ on other model families.
When Not To Use
When you need formal cryptographic guarantees rather than empirical measures.
If you cannot run or afford multi-agent calls to an external LLM.
Failure Modes
Judge bias: judge agent may overfit to detectable markers and misestimate advantage.
False negatives if sanitized prompts unintentionally preserve distinctive phrasing.

