Trace which memory or tool input actually drove an LLM agent's action.

January 21, 20266 min

Overview

Production Readiness

0.6

Novelty Score

0.55

Cost Impact Score

0.35

Citation Count

0

Authors

Chen Qian, Peng Wang, Dongrui Liu, Junyao Yang, Dadi Guo, Ling Tang, Jilin Mei, Qihan Ren, Shuai Shao, Yong Liu, Jie Fu, Jing Shao, Xia Hu

Links

Abstract / PDF

Why It Matters For Business

You can audit autonomous agents to see which past memory or tool output caused a decision—useful for compliance, debugging, and fixing business rule violations without needing explicit failures.

Summary TLDR

This paper presents a two-stage attribution framework that explains why an LLM-based agent produced a specific action. First it replays the agent trajectory component-by-component and scores the marginal likelihood gain to find high-impact components (temporal likelihood dynamics). Then it ablates sentences inside those components (probability drop & hold) to surface the exact textual evidence. The method is evaluated on eight curated agent trajectories (memory- and tool-driven scenarios) using Llama-3.1-70B-Instruct; a simple leave-one-out and linear baselines are compared. Code is released.

Problem Statement

Existing attribution work focuses on locating explicit failures. But many undesirable agent actions occur without an explicit error signal (for example a reasonable-looking refund or a privacy leak caused by a retrieved email). We need methods to explain which past memory entries, tool returns, or sentences actually drove a chosen action.

Main Contribution

A hierarchical agentic attribution framework: component-level temporal replay + sentence-level perturbation.

Component-level method: score marginal likelihood gains when incrementally revealing trajectory components.

Sentence-level method: combine probability drop (necessary) and probability hold (sufficient) via ablation.

Empirical evaluation on eight curated agent trajectories (memory- and tool-driven cases) using Llama-3.1-70B-Instruct.

Code release: https://github.com/AI45Lab/AgentDoG

Key Findings

Prob. Drop&Hold hits the human-labelled top sentence 93.75% of the time (Hit@1).

NumbersHit@1 = 0.9375 (Table 1)

Multiple sentence-level attribution methods work well inside the framework; leave-one-out and ContextCite reach 81.25% Hit@1.

NumbersLOO, ContextCite Hit@1 = 0.8125 (Table 1)

The framework pinpoints diverse driver types: memory reuse, prompt injections, early spurious tool signals, and hallucination from user prompts.

Results

Hit@1 (Prob. Drop&Hold)

Value0.9375

BaselineLOO 0.8125

Hit@3 (Prob. Drop&Hold)

Value1.0

BaselineLOO 0.9375

Who Should Care

What To Try In 7 Days

Run the component-level replay on a few real agent traces to surface high-impact steps.

Apply the probability drop&hold ablation to top components to surface the exact sentence that drove an action.

Use findings to add simple guards: ignore untrusted tool text, downweight single-case memories, or require explicit evidence before high-risk actions.

Agent Features

Memory

  • retrieval memory
  • long-term memory updates

Planning

  • incremental replay of trajectory (temporal likelihood dynamics)

Tool Use

  • treat tool outputs as observations
  • expose tool returns (email, web search, file reader) for attribution

Frameworks

  • smolagents

Is Agentic

true

Architectures

  • LLM-based agent (single-agent)

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Evaluation uses eight curated cases and a single model (Llama-3.1-70B-Instruct), limiting generality.
  • Sentence-level ablation requires model likelihood access and can be costly on long contexts.
  • Human ground truth uses intersection of five annotators, which is conservative and may miss valid alternative evidence.

When Not To Use

  • When you cannot compute or compare model likelihoods for ablated inputs (closed APIs without log-probs).
  • As a fully automated monitor at large scale without further work to automate interpretation.
  • If you need end-to-end causal proof rather than plausible evidence localization.

Failure Modes

  • Multiple components share influence and attribution may split credit ambiguously.
  • Agent self-contradiction or latent internal state can lead to misleading likelihood signals.
  • Gradient salience runs OOM on long traces (noted for Saliency Score).

Core Entities

Models

  • Llama-3.1-70B-Instruct

Metrics

  • Hit@k (Hit@1/3/5)
  • log-likelihood gain

Datasets

  • custom 8-case agent trajectories

Benchmarks

  • GAIA (one complex retrieval case used)

Context Entities

Models

  • none other explicitly evaluated

Metrics

  • probability drop
  • probability hold

Datasets

  • GAIA (cited)
  • human annotation intersection (5 annotators) for ground truth