AttnLRP: a faithful, efficient LRP variant that attributes attention and latent neurons in transformers

Overview

Decision SnapshotReady For Pilot

The method is well validated on multiple models and datasets; it is ready for real experiments but needs memory management and γ tuning for large ViTs.

Citations9

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/3

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 60%

Authors

Reduan Achtibat, Sayed Mohammad Vakilzadeh Hatefi, Maximilian Dreyer, Aakriti Jain, Thomas Wiegand, Sebastian Lapuschkin, Wojciech Samek

Links

Abstract / PDF / Code

Why It Matters For Business

AttnLRP gives faster, more faithful explanations for transformer decisions, lowering debugging cost and energy compared to perturbation; it also exposes neurons you can target to reduce hallucinations or bias.

Who Should Care

ML Engineer Data Scientist Product Manager CTO

Summary TLDR

AttnLRP extends Layer-wise Relevance Propagation (LRP) to handle transformer-specific functions (softmax, matrix multiplication, normalization). It produces more faithful input and latent (neuron-level) attributions than prior methods while keeping computation comparable to a single backward pass. Evaluations on ViTs and LLMs (LLaMa 2, Mixtral, Flan‑T5, Phi) show consistent faithfulness gains. The method also enables finding and manipulating ‘knowledge neurons’ to change model outputs.

Problem Statement

Transformers use nonlinear attention, matrix multiplications and normalization that break standard attribution rules. Existing attention-only or simple backprop methods are either unfaithful, noisy, numerically unstable, or too expensive to get layerwise/latent attributions. We need a single-pass, numerically stable method that attributes both inputs and hidden neurons in large transformers.

Main Contribution

New LRP rules for transformer operations: derived faithful, efficient propagation rules for softmax, bilinear matrix multiplication and normalization tailored to attention.

Latent-neuron attribution and interaction: AttnLRP yields per-neuron relevances and, combined with activation maximization, enables identifying and manipulating neurons that shift model outputs.

Key Findings

AttnLRP yields higher faithfulness than prior LRP variants on next‑token/classification perturbation tests.

NumbersWikipedia perturbation area: AttnLRP 10.93 vs CP‑LRP 7.85 (∆=+3.08)

Practical UseUse AttnLRP when you need more faithful input attributions for transformer language tasks; it meaningfully outperforms conservative LRP on evaluated benchmarks.

Evidence RefTable 1 (Wikipedia next‑word), Section 4.1

AttnLRP improves top‑1 token identification accuracy in QA on Mi x tral 8x7b from 0.50 (CP‑LRP) to 0.96.

NumbersMixtral SQuADv2 top‑1 accuracy: AttnLRP 0.96 vs CP‑LRP 0.50

Practical UseFor question‑answering models with routing/expert layers, AttnLRP gives much clearer token-level explanations; prefer it for debugging or auditing QA outputs.

Evidence RefTable 1 (SQuADv2), Section 4.1.2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Faithfulness (perturbation area)	10.93 (AttnLRP, LLaMa2 Wikipedia)	7.85 (CP-LRP)	+3.08	Wikipedia next-word	Table 1, Section 4.1	—
Accuracy	0.96 (AttnLRP, Mixtral 8x7b)	0.50 (CP-LRP)	+0.46	Mixtral SQuADv2	Table 1, Section 4.1	—

What To Try In 7 Days

Install AttnLRP from the paper's GitHub and run it on a small model and dataset to compare heatmaps with existing methods.

Run the perturbation faithfulness test (MoRF/LeRF area) to validate explanations on your task.

Use activation‑max samples + AttnLRP to find top neurons for a concept and try small neuron ablations to observe output shifts.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/rachtibat/LRP-eXplains-Transformers

Risks & Boundaries

Limitations

ViTs require tuning of the γ hyperparameter to reduce noisy attributions.

Large models need checkpointing and substantial GPU memory; AttnLRP can exceed single‑node memory for very large contexts.

When Not To Use

On tiny edge devices where memory and compute cannot support checkpointed backward passes.

If you only need a cheap, approximate attention‑only heatmap (attention rollout may suffice).

Failure Modes

Numerical instabilities if bias handling is changed (distributing bias or identity rule) — can explode relevances.

Mis-tuned γ leads to under- or over-smoothing of attributions in vision models.

Core Entities

Models

LLaMa 2-7bMixtral 8x7bFlan-T5-XLViT-B-16ViT-L-16ViT-L-32Phi-1.5

Metrics

faithfulness area (A between curves)AccuracyIntersection over Union (IoU)

Datasets

ImageNetIMDBWikipediaSQuADv2Wikipedia summary dataset

Benchmarks

Perturbation faithfulness (area between MoRF/LeRF curves)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

AttnLRP yields higher faithfulness than prior LRP variants on next‑token/classification perturbation tests.

AttnLRP improves top‑1 token identification accuracy in QA on Mi x tral 8x7b from 0.50 (CP‑LRP) to 0.96.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Bi'an: a bilingual RAG hallucination benchmark plus small fine-tuned judge models

Key finding

MultiHal: a multilingual, Wikidata-grounded benchmark that uses KG paths to evaluate and reduce LLM hallucinations

Key finding

DiaHalu: 1,103 multi-turn dialogues to test hallucination in chat-style LLMs

Key finding

An open leaderboard that measures LLM hallucinations across 15 tasks and 20 models

Key finding

LLMs (GPT-3.5, GPT-4, PaLM-2) do not reliably judge factuality on the FRANK benchmark

Key finding