Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
9
Why It Matters For Business
AttnLRP gives faster, more faithful explanations for transformer decisions, lowering debugging cost and energy compared to perturbation; it also exposes neurons you can target to reduce hallucinations or bias.
Summary TLDR
AttnLRP extends Layer-wise Relevance Propagation (LRP) to handle transformer-specific functions (softmax, matrix multiplication, normalization). It produces more faithful input and latent (neuron-level) attributions than prior methods while keeping computation comparable to a single backward pass. Evaluations on ViTs and LLMs (LLaMa 2, Mixtral, Flan‑T5, Phi) show consistent faithfulness gains. The method also enables finding and manipulating ‘knowledge neurons’ to change model outputs.
Problem Statement
Transformers use nonlinear attention, matrix multiplications and normalization that break standard attribution rules. Existing attention-only or simple backprop methods are either unfaithful, noisy, numerically unstable, or too expensive to get layerwise/latent attributions. We need a single-pass, numerically stable method that attributes both inputs and hidden neurons in large transformers.
Main Contribution
New LRP rules for transformer operations: derived faithful, efficient propagation rules for softmax, bilinear matrix multiplication and normalization tailored to attention.
Latent-neuron attribution and interaction: AttnLRP yields per-neuron relevances and, combined with activation maximization, enables identifying and manipulating neurons that shift model outputs.
Open-source implementation: a ready-to-use library and practical guidance (including a γ-rule for denoising ViTs) to run AttnLRP on LLMs and ViTs.
Key Findings
AttnLRP yields higher faithfulness than prior LRP variants on next‑token/classification perturbation tests.
AttnLRP improves top‑1 token identification accuracy in QA on Mi x tral 8x7b from 0.50 (CP‑LRP) to 0.96.
AttnLRP runs with single backward‑pass efficiency and much lower forward‑pass cost than linear perturbation methods.
AttnLRP makes neuron interventions practical: activating/deactivating identified neurons changes generated tokens.
Vision transformers show noisy gradients; applying the γ‑rule improves faithfulness.
Results
Faithfulness (perturbation area)
Accuracy
Faithfulness (perturbation area)
Who Should Care
What To Try In 7 Days
Install AttnLRP from the paper's GitHub and run it on a small model and dataset to compare heatmaps with existing methods.
Run the perturbation faithfulness test (MoRF/LeRF area) to validate explanations on your task.
Use activation‑max samples + AttnLRP to find top neurons for a concept and try small neuron ablations to observe output shifts.
Reproducibility
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- ViTs require tuning of the γ hyperparameter to reduce noisy attributions.
- Large models need checkpointing and substantial GPU memory; AttnLRP can exceed single‑node memory for very large contexts.
- Softmax saturation (very low temperature) can stop gradient-based relevance flow; classification softmax attribution was bypassed in experiments.
When Not To Use
- On tiny edge devices where memory and compute cannot support checkpointed backward passes.
- If you only need a cheap, approximate attention‑only heatmap (attention rollout may suffice).
- When you cannot tune γ and the model is a noisy ViT — results may be poor without tuning.
Failure Modes
- Numerical instabilities if bias handling is changed (distributing bias or identity rule) — can explode relevances.
- Mis-tuned γ leads to under- or over-smoothing of attributions in vision models.
- Softmax at classification outputs can absorb relevance when gradients vanish, distorting attributions.
Core Entities
Models
- LLaMa 2-7b
- Mixtral 8x7b
- Flan-T5-XL
- ViT-B-16
- ViT-L-16
- ViT-L-32
- Phi-1.5
Metrics
- faithfulness area (A between curves)
- Accuracy
- Intersection over Union (IoU)
Datasets
- ImageNet
- IMDB
- Wikipedia
- SQuADv2
- Wikipedia summary dataset
Benchmarks
- Perturbation faithfulness (area between MoRF/LeRF curves)

