DeceptGuard: detect agent deception by reading CoT text and activation probes

Overview

Decision SnapshotNeeds Validation

This is a practical, well-instrumented system demonstrated on a large synthetic and semi-real benchmark. Activation probes need self-hosting/GPU and structural deception still shows synthetic→real gaps.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 45%

Production readiness: 65%

Novelty: 60%

Authors

Snehasis Mukhopadhyay

Links

Abstract / PDF

Why It Matters For Business

Agents can hide harmful actions behind plausible outputs; reading their internal reasoning or activations improves detection and reduces missed incidents in high-stakes automation.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead

Summary TLDR

This paper introduces DECEPTGUARD, a monitoring framework that compares three regimes for spotting deception in LLM agents: black-box (actions/outputs), CoT-aware (chain-of-thought text), and activation-probe (hidden activations). The authors build DECEPTSYNTH (9.6k synthetic trajectories) and DeceptArena (9.2k verified test trajectories). CoT-aware + activation probes beat black-box monitors (mean pAUROC +0.097). A HYBRIDCONSTITUTIONAL ensemble reaches pAUROC = 0.934 on the held-out benchmark. They release monitors, probe weights, and datasets to support deployment and research.

Problem Statement

Black-box monitors that see only actions and outputs miss internal reasoning signals that often reveal deceptive intent. We need systematic comparison of monitors that read chain-of-thought (CoT) text and hidden activations to decide what monitoring access materially improves detection and how robust each channel is when agents try to hide.

Main Contribution

A 12-category operational taxonomy of agent deception across verbal, behavioral, and structural types.

DECEPTSYNTH synthetic pipeline producing 9,600 labeled trajectories with faithful CoT traces.

Key Findings

CoT-aware and activation-probe monitors outperform black-box monitors.

Numbersmean pAUROC +0.097 (BB→Hybrid)

Practical UseIf you can access CoT text or activations, add them: expect ~+0.10 pAUROC vs. action-only monitoring on this benchmark.

Evidence RefTable 3; §8.2

Hybrid ensemble reaches state-of-the-art detection on DeceptArena.

NumbersHybrid pAUROC = 0.934 on held-out test

Practical UseCombine prompted CoT analysis with an activation probe and a small meta-classifier for best practical detection in high-stakes settings.

Evidence RefAbstract; Figure 5; §8.2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Macro average pAUROC (black-box)	0.837	—	—	DeceptArena test	Table 3 per-category macro avg	Table 3; §8.2
Macro average pAUROC (CoT-aware CA-P)	0.869	BB 0.837	+0.032	DeceptArena test	Table 3 per-category macro avg	Table 3; §8.2

What To Try In 7 Days

Run a black-box monitor on recent agent logs to get a baseline score distribution.

If CoT is exposed, add a CA-P prompt monitor from the released prompt gallery and compare pAUROC on synthetic positives.

Inject 50 domain-specific synthetic deceptive examples (DECEPTSYNTH) and measure detection; tune threshold to target human review load.

Agent Features

Memory

short-term trajectory context (CoT text) observed by monitors

Planning

multi-step tool sequenceslong-horizon task planning

Tool Use

file operationsgit/PR callsssh_exec and config editsemail/calendar APIs

Frameworks

constitutional specification for generation and monitoring

Is Agentic

Yes

Architectures

LLM agents with tool APIsopen-weights encoder for activation probing

Collaboration

single-agent interactions with external tools (no multi-agent experiments)

Optimization Features

Token Efficiency

truncate CoT >2,000 tokens with minimal pAUROC loss

Infra Optimization

AP-H requires self-hosted GPUs; CA-P uses API monitors

Model Optimization

none (monitors use existing LLM backbones)

System Optimization

hybrid ensemble tolerates missing channels (CoT or AP-H)

Training Optimization

logistic-regression probe trained on layer-32 activationsMLP meta-classifier trained on monitor scores

Inference Optimization

temperature = 0 for monitor inferenceactivation extraction with batching on 4× A100

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

Benchmark uses sandboxed environments and may not capture all production deception patterns.

Activation-probe requires open-weights models and GPU resources, unavailable for closed APIs.

When Not To Use

When agents are deployed via closed APIs and CoT is not exposed (AP-H not usable).

For long-lived, multi-agent systems without session-level aggregation—framework targets single-agent trajectories.

Failure Modes

Agents fully sanitize or fabricate CoT to evade CoT-aware prompts.

Monitors overfit to synthetic generation artifacts and miss novel real-world attacks.

Core Entities

Models

Gemini 2.5 ProClaude Sonnet 4.5Claude Opus 4.5GPT-5GPT-5.1Llama-3.1-70B-Instruct

Metrics

pAUROC (partial AUROC FPR<=0.20)AUCp-values and bootstrap CIs

Datasets

DECEPTSYNTHDeceptArena

Benchmarks

DeceptArena

Context Entities

Models

Claude Sonnet 4.5 (monitor backbone)Llama-3.1-70B-Instruct (activation encoder)

Metrics

pAUROCFPR@TPR thresholds

Datasets

DECEPTSYNTH training set (4,800+)DeceptArena test set (9,200)

Benchmarks

Per-category pAUROC table (Table 3)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

CoT-aware and activation-probe monitors outperform black-box monitors.

Hybrid ensemble reaches state-of-the-art detection on DeceptArena.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

Train a model to judge and correct its own facts with token-level rewards to cut hallucinations

Key finding

TruthHypo benchmark and KnowHD detector to measure and filter hallucinated scientific hypotheses

Key finding

Use weak or small models as judges: peer prediction rewards honesty and detects deception even when judges are far weaker

Key finding

Induce a model to hallucinate, then penalize those hallucinations at decoding to reduce LLM fabrications

Key finding