ARC-JSD: a fast, training-free JSD method to find which retrieved sentences make a RAG answer

Overview

Decision SnapshotNeeds Validation

ARC-JSD is a practical inference-only tool with solid evidence on standard RAG QA benchmarks; strengths are compute savings and mechanistic consistency, while limits include sentence-level granularity and dependency on models exposing probabilities.

Citations1

Evidence Strength0.70

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 60%

Authors

Ruizhe Li, Chen Chen, Yuchen Hu, Yanjun Gao, Xi Wang, Emine Yilmaz

Links

Abstract / PDF / Code / Data

Why It Matters For Business

ARC-JSD gives a cheap, plug-in way to show which retrieved sentences actually caused an LLM answer, cutting compute costs and reducing hallucinations—useful for product trust, compliance, and debugging.

Who Should Care

ML Engineer Product Manager CTO Data Scientist

Summary TLDR

The paper introduces ARC-JSD, a lightweight inference-time method that ranks retrieved sentences by how much removing each sentence changes the model's output distribution, measured with Jensen-Shannon divergence (JSD). ARC-JSD needs only forward passes (no fine-tuning, gradients, or surrogate models), yields ≈10.7% average improvement in top-1 sentence attribution versus prior training-free baselines on TyDi QA, Hotpot QA and MuSiQue, and cuts compute cost up to 3x versus surrogate/gradient methods. The method also locates attention heads and MLP layers important for attribution and uses them to reduce hallucination (~39% drop) without harming factual F1.

Problem Statement

In Retrieval-Augmented Generation (RAG), it's hard and costly to verify which retrieved sentences actually caused a model's answer. Existing approaches need heavy fine-tuning, many forward passes, gradient computations, or human labels. We need a fast, training-free way to attribute responses to specific context sentences and to inspect which internal components use them.

Main Contribution

ARC-JSD: an inference-only, Jensen-Shannon-divergence method to rank context sentences by their causal effect on the output distribution.

Empirical demonstration that ARC-JSD improves top-1 context attribution accuracy by ~10.7% on standard RAG QA benchmarks while reducing compute up to 3× versus prior baselines.

Key Findings

ARC-JSD improves top-1 sentence attribution accuracy versus prior training-free baselines.

Numbers≈10.7% average accuracy gain (MuSiQue summary; §4.2, Fig.2)

Practical UseUse ARC-JSD to more reliably pick the single sentence that grounded an answer, improving auditability without extra training.

Evidence RefAbstract; §4.2; Fig.2

ARC-JSD reduces inference compute relative to surrogate/gradient baselines.

NumbersUp to 3× speedup vs ContextCite/surrogate baselines (§4.2; H)

Practical UseRun ARC-JSD in production to get attribution with far lower GPU costs than methods requiring hundreds of forward passes or fine-tuning.

Evidence RefTable1; §4.2; Appendix H; Fig.2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	≈+10.7% vs training-free baselines	ALTI-Logit/MIRAGE/ContextCite	+10.7%	Aggregate over TyDi QA, Hotpot QA, MuSiQue	Fig.2; §4.2	Fig.2; §4.2
Compute cost	Up to 3× faster	ContextCite and gradient-based baselines	≤1/3 GFLOPs per sample	MuSiQue and others (compute-accuracy trade-off)	Table1; Fig.2; Appendix H	Table1; Fig.2

What To Try In 7 Days

Run ARC-JSD on a sample of production RAG queries to flag low-evidence answers (sentence-JSD < 0.02 bits).

Compare ARC-JSD top-1 sentence vs your current citation heuristic to measure attribution gaps.

Use ARC-JSD to find top attention/MLP components and test gating them to reduce hallucinations safely.

Agent Features

Memory

retrieval context (sentence-level)

Architectures

autoregressive Transformer

Optimization Features

Infra Optimization

lower GFLOPs per sample; practical 3× speedup reported

Training Optimization

none required (inference-only method)

Inference Optimization

reduces forward-call budget vs surrogate/gradient methodssingle ablation per sentence (no gradient/backprop)

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/ruizheliUOA/ARC_JSD

Data URLs

TyDi QA (public)Hotpot QA (public)MuSiQue (public)

Risks & Boundaries

Limitations

Granularity limited to sentence-level in reported experiments; finer spans need extra engineering.

Does not identify individual neurons inside MLPs; layer-level only.

When Not To Use

When you need token- or phrase-level attribution out of the box (paper reports sentence-level).

When the LLM does not expose reliable next-token probabilities or logits.

Failure Modes

All JSD scores very small: means model likely ignored context; ARC-JSD will report low evidence rather than force a label.

If model answer comes from parametric memory (not retrieved context), JSD may be low and attribution will be uninformative.

Core Entities

Models

Qwen2-1.5B-ITQwen2-7B-ITGemma2-2B-ITGemma2-9B-ITLLaMA-3.1-8B-ITQwen3-Next-80B-A3B-IT

Metrics

Jensen-Shannon divergence (bits)AccuracyGFLOPs per sampleHallucination rate (%)Pass@1 factual F1 (%)

Datasets

TyDi QAHotpot QAMuSiQuePubMedQAMedQuADLegalBench

Benchmarks

Accuracy

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

ARC-JSD improves top-1 sentence attribution accuracy versus prior training-free baselines.

ARC-JSD reduces inference compute relative to surrogate/gradient baselines.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Case-aware LLM-as-a-judge scoring: eight enterprise metrics, severity-weighting, and JSON outputs for multi-turn RAG

Key finding

RGB: a bilingual benchmark diagnosing how LLMs fail when using retrieved evidence

Key finding

Curate systematic reviews + guidelines to make RAG answers more trustworthy for Long COVID

Key finding

Mask untruthful parts of context to cut hallucinations and keep helpful facts

Key finding

Practical survey of RAG: paradigms, core components, benchmarks, and engineering gaps

Key finding