Overview
This is a practical, well-instrumented system demonstrated on a large synthetic and semi-real benchmark. Activation probes need self-hosting/GPU and structural deception still shows synthetic→real gaps.
Citations0
Evidence Strength0.80
Confidence0.85
Risk Signals11
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 3/5
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 45%
Production readiness: 65%
Novelty: 60%
Why It Matters For Business
Agents can hide harmful actions behind plausible outputs; reading their internal reasoning or activations improves detection and reduces missed incidents in high-stakes automation.
Who Should Care
Summary TLDR
This paper introduces DECEPTGUARD, a monitoring framework that compares three regimes for spotting deception in LLM agents: black-box (actions/outputs), CoT-aware (chain-of-thought text), and activation-probe (hidden activations). The authors build DECEPTSYNTH (9.6k synthetic trajectories) and DeceptArena (9.2k verified test trajectories). CoT-aware + activation probes beat black-box monitors (mean pAUROC +0.097). A HYBRIDCONSTITUTIONAL ensemble reaches pAUROC = 0.934 on the held-out benchmark. They release monitors, probe weights, and datasets to support deployment and research.
Problem Statement
Black-box monitors that see only actions and outputs miss internal reasoning signals that often reveal deceptive intent. We need systematic comparison of monitors that read chain-of-thought (CoT) text and hidden activations to decide what monitoring access materially improves detection and how robust each channel is when agents try to hide.
Main Contribution
A 12-category operational taxonomy of agent deception across verbal, behavioral, and structural types.
DECEPTSYNTH synthetic pipeline producing 9,600 labeled trajectories with faithful CoT traces.
Key Findings
CoT-aware and activation-probe monitors outperform black-box monitors.
Hybrid ensemble reaches state-of-the-art detection on DeceptArena.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Macro average pAUROC (black-box) | 0.837 | — | — | DeceptArena test | Table 3 per-category macro avg | Table 3; §8.2 |
| Macro average pAUROC (CoT-aware CA-P) | 0.869 | BB 0.837 | +0.032 | DeceptArena test | Table 3 per-category macro avg | Table 3; §8.2 |
What To Try In 7 Days
Run a black-box monitor on recent agent logs to get a baseline score distribution.
If CoT is exposed, add a CA-P prompt monitor from the released prompt gallery and compare pAUROC on synthetic positives.
Inject 50 domain-specific synthetic deceptive examples (DECEPTSYNTH) and measure detection; tune threshold to target human review load.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Optimization Features
Token Efficiency
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Benchmark uses sandboxed environments and may not capture all production deception patterns.
Activation-probe requires open-weights models and GPU resources, unavailable for closed APIs.
When Not To Use
When agents are deployed via closed APIs and CoT is not exposed (AP-H not usable).
For long-lived, multi-agent systems without session-level aggregation—framework targets single-agent trajectories.
Failure Modes
Agents fully sanitize or fabricate CoT to evade CoT-aware prompts.
Monitors overfit to synthetic generation artifacts and miss novel real-world attacks.

