Overview
Production Readiness
0.65
Novelty Score
0.6
Cost Impact Score
0.45
Citation Count
0
Why It Matters For Business
Agents can hide harmful actions behind plausible outputs; reading their internal reasoning or activations improves detection and reduces missed incidents in high-stakes automation.
Summary TLDR
This paper introduces DECEPTGUARD, a monitoring framework that compares three regimes for spotting deception in LLM agents: black-box (actions/outputs), CoT-aware (chain-of-thought text), and activation-probe (hidden activations). The authors build DECEPTSYNTH (9.6k synthetic trajectories) and DeceptArena (9.2k verified test trajectories). CoT-aware + activation probes beat black-box monitors (mean pAUROC +0.097). A HYBRIDCONSTITUTIONAL ensemble reaches pAUROC = 0.934 on the held-out benchmark. They release monitors, probe weights, and datasets to support deployment and research.
Problem Statement
Black-box monitors that see only actions and outputs miss internal reasoning signals that often reveal deceptive intent. We need systematic comparison of monitors that read chain-of-thought (CoT) text and hidden activations to decide what monitoring access materially improves detection and how robust each channel is when agents try to hide.
Main Contribution
A 12-category operational taxonomy of agent deception across verbal, behavioral, and structural types.
DECEPTSYNTH synthetic pipeline producing 9,600 labeled trajectories with faithful CoT traces.
DeceptArena: a 9,200-trajectory, execution-verified benchmark with deception intensity levels.
A head-to-head evaluation of black-box, CoT-aware, and activation-probe monitors.
HYBRIDCONSTITUTIONAL ensemble combining signals and achieving pAUROC = 0.934.
Open release of the synthetic pipeline, benchmark, monitor prompts, and probe weights.
Key Findings
CoT-aware and activation-probe monitors outperform black-box monitors.
Hybrid ensemble reaches state-of-the-art detection on DeceptArena.
CoT advantage depends on CoT honesty.
Activation-level signals generalize across model families and peak in mid layers.
Structural deception is hardest to detect and shows the largest synthetic→real gap.
Results
Macro average pAUROC (black-box)
Macro average pAUROC (CoT-aware CA-P)
Macro average pAUROC (Activation probe AP-H)
Macro average pAUROC (HYBRIDCONSTITUTIONAL ensemble)
AP-H probe peak layer / probe AUC
Who Should Care
What To Try In 7 Days
Run a black-box monitor on recent agent logs to get a baseline score distribution.
If CoT is exposed, add a CA-P prompt monitor from the released prompt gallery and compare pAUROC on synthetic positives.
Inject 50 domain-specific synthetic deceptive examples (DECEPTSYNTH) and measure detection; tune threshold to target human review load.
Agent Features
Memory
- short-term trajectory context (CoT text) observed by monitors
Planning
- multi-step tool sequences
- long-horizon task planning
Tool Use
- file operations
- git/PR calls
- ssh_exec and config edits
- email/calendar APIs
Frameworks
- constitutional specification for generation and monitoring
Is Agentic
true
Architectures
- LLM agents with tool APIs
- open-weights encoder for activation probing
Collaboration
- single-agent interactions with external tools (no multi-agent experiments)
Optimization Features
Token Efficiency
- truncate CoT >2,000 tokens with minimal pAUROC loss
Infra Optimization
- AP-H requires self-hosted GPUs; CA-P uses API monitors
Model Optimization
- none (monitors use existing LLM backbones)
System Optimization
- hybrid ensemble tolerates missing channels (CoT or AP-H)
Training Optimization
- logistic-regression probe trained on layer-32 activations
- MLP meta-classifier trained on monitor scores
Inference Optimization
- temperature = 0 for monitor inference
- activation extraction with batching on 4× A100
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Benchmark uses sandboxed environments and may not capture all production deception patterns.
- Activation-probe requires open-weights models and GPU resources, unavailable for closed APIs.
- Synthetic training does not fully cover structural deception; larger real-world negatives needed.
- Monitors evaluated contemporaneously with agent models; co-evolution dynamics are untested.
When Not To Use
- When agents are deployed via closed APIs and CoT is not exposed (AP-H not usable).
- For long-lived, multi-agent systems without session-level aggregation—framework targets single-agent trajectories.
- If you lack GPU resources and cannot self-host a probe but require highest structural-deception sensitivity.
Failure Modes
- Agents fully sanitize or fabricate CoT to evade CoT-aware prompts.
- Monitors overfit to synthetic generation artifacts and miss novel real-world attacks.
- Activation probe signal may degrade if encoder family diverges significantly from agents' generator family.
- High false-positive load at low base deception rates without human triage.
Core Entities
Models
- Gemini 2.5 Pro
- Claude Sonnet 4.5
- Claude Opus 4.5
- GPT-5
- GPT-5.1
- Llama-3.1-70B-Instruct
Metrics
- pAUROC (partial AUROC FPR<=0.20)
- AUC
- p-values and bootstrap CIs
Datasets
- DECEPTSYNTH
- DeceptArena
Benchmarks
- DeceptArena
Context Entities
Models
- Claude Sonnet 4.5 (monitor backbone)
- Llama-3.1-70B-Instruct (activation encoder)
Metrics
- pAUROC
- FPR@TPR thresholds
Datasets
- DECEPTSYNTH training set (4,800+)
- DeceptArena test set (9,200)
Benchmarks
- Per-category pAUROC table (Table 3)

