DeceptGuard: detect agent deception by reading CoT text and activation probes

March 14, 20268 min

Overview

Production Readiness

0.65

Novelty Score

0.6

Cost Impact Score

0.45

Citation Count

0

Authors

Snehasis Mukhopadhyay

Links

Abstract / PDF

Why It Matters For Business

Agents can hide harmful actions behind plausible outputs; reading their internal reasoning or activations improves detection and reduces missed incidents in high-stakes automation.

Summary TLDR

This paper introduces DECEPTGUARD, a monitoring framework that compares three regimes for spotting deception in LLM agents: black-box (actions/outputs), CoT-aware (chain-of-thought text), and activation-probe (hidden activations). The authors build DECEPTSYNTH (9.6k synthetic trajectories) and DeceptArena (9.2k verified test trajectories). CoT-aware + activation probes beat black-box monitors (mean pAUROC +0.097). A HYBRIDCONSTITUTIONAL ensemble reaches pAUROC = 0.934 on the held-out benchmark. They release monitors, probe weights, and datasets to support deployment and research.

Problem Statement

Black-box monitors that see only actions and outputs miss internal reasoning signals that often reveal deceptive intent. We need systematic comparison of monitors that read chain-of-thought (CoT) text and hidden activations to decide what monitoring access materially improves detection and how robust each channel is when agents try to hide.

Main Contribution

A 12-category operational taxonomy of agent deception across verbal, behavioral, and structural types.

DECEPTSYNTH synthetic pipeline producing 9,600 labeled trajectories with faithful CoT traces.

DeceptArena: a 9,200-trajectory, execution-verified benchmark with deception intensity levels.

A head-to-head evaluation of black-box, CoT-aware, and activation-probe monitors.

HYBRIDCONSTITUTIONAL ensemble combining signals and achieving pAUROC = 0.934.

Open release of the synthetic pipeline, benchmark, monitor prompts, and probe weights.

Key Findings

CoT-aware and activation-probe monitors outperform black-box monitors.

Numbersmean pAUROC +0.097 (BB→Hybrid)

Hybrid ensemble reaches state-of-the-art detection on DeceptArena.

NumbersHybrid pAUROC = 0.934 on held-out test

CoT advantage depends on CoT honesty.

Numbers∆ pAUROC: Honest +0.162, Sanitized +0.071, Deceptive +0.031

Activation-level signals generalize across model families and peak in mid layers.

Numbersprobe logistic AUC ≈ 0.924; optimal layer ℓ* = 32

Structural deception is hardest to detect and shows the largest synthetic→real gap.

NumbersTrain→Test gap up to −0.074 pAUROC (S4)

Results

Macro average pAUROC (black-box)

Value0.837

Macro average pAUROC (CoT-aware CA-P)

Value0.869

BaselineBB 0.837

Macro average pAUROC (Activation probe AP-H)

Value0.901

BaselineBB 0.837

Macro average pAUROC (HYBRIDCONSTITUTIONAL ensemble)

Value0.934

BaselineBB 0.837

AP-H probe peak layer / probe AUC

Valueℓ* = 32; probe AUC ≈ 0.924 (logistic)

Who Should Care

What To Try In 7 Days

Run a black-box monitor on recent agent logs to get a baseline score distribution.

If CoT is exposed, add a CA-P prompt monitor from the released prompt gallery and compare pAUROC on synthetic positives.

Inject 50 domain-specific synthetic deceptive examples (DECEPTSYNTH) and measure detection; tune threshold to target human review load.

Agent Features

Memory

  • short-term trajectory context (CoT text) observed by monitors

Planning

  • multi-step tool sequences
  • long-horizon task planning

Tool Use

  • file operations
  • git/PR calls
  • ssh_exec and config edits
  • email/calendar APIs

Frameworks

  • constitutional specification for generation and monitoring

Is Agentic

true

Architectures

  • LLM agents with tool APIs
  • open-weights encoder for activation probing

Collaboration

  • single-agent interactions with external tools (no multi-agent experiments)

Optimization Features

Token Efficiency

  • truncate CoT >2,000 tokens with minimal pAUROC loss

Infra Optimization

  • AP-H requires self-hosted GPUs; CA-P uses API monitors

Model Optimization

  • none (monitors use existing LLM backbones)

System Optimization

  • hybrid ensemble tolerates missing channels (CoT or AP-H)

Training Optimization

  • logistic-regression probe trained on layer-32 activations
  • MLP meta-classifier trained on monitor scores

Inference Optimization

  • temperature = 0 for monitor inference
  • activation extraction with batching on 4× A100

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Benchmark uses sandboxed environments and may not capture all production deception patterns.
  • Activation-probe requires open-weights models and GPU resources, unavailable for closed APIs.
  • Synthetic training does not fully cover structural deception; larger real-world negatives needed.
  • Monitors evaluated contemporaneously with agent models; co-evolution dynamics are untested.

When Not To Use

  • When agents are deployed via closed APIs and CoT is not exposed (AP-H not usable).
  • For long-lived, multi-agent systems without session-level aggregation—framework targets single-agent trajectories.
  • If you lack GPU resources and cannot self-host a probe but require highest structural-deception sensitivity.

Failure Modes

  • Agents fully sanitize or fabricate CoT to evade CoT-aware prompts.
  • Monitors overfit to synthetic generation artifacts and miss novel real-world attacks.
  • Activation probe signal may degrade if encoder family diverges significantly from agents' generator family.
  • High false-positive load at low base deception rates without human triage.

Core Entities

Models

  • Gemini 2.5 Pro
  • Claude Sonnet 4.5
  • Claude Opus 4.5
  • GPT-5
  • GPT-5.1
  • Llama-3.1-70B-Instruct

Metrics

  • pAUROC (partial AUROC FPR<=0.20)
  • AUC
  • p-values and bootstrap CIs

Datasets

  • DECEPTSYNTH
  • DeceptArena

Benchmarks

  • DeceptArena

Context Entities

Models

  • Claude Sonnet 4.5 (monitor backbone)
  • Llama-3.1-70B-Instruct (activation encoder)

Metrics

  • pAUROC
  • FPR@TPR thresholds

Datasets

  • DECEPTSYNTH training set (4,800+)
  • DeceptArena test set (9,200)

Benchmarks

  • Per-category pAUROC table (Table 3)