At decode time, subtract earlier-layer logits from later-layer logits to reduce hallucinations.

Overview

Decision SnapshotReady For Pilot

The method is simple and well-supported by experiments: it leverages observed per-layer knowledge localization and works at inference with small overhead, but it cannot fix wrong facts learned during pretraining.

Citations17

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 15%

Production readiness: 70%

Novelty: 55%

Authors

Yung-Sung Chuang, Yujia Xie, Hongyin Luo, Yoon Kim, James Glass, Pengcheng He

Links

Abstract / PDF / Code / Data

Why It Matters For Business

DoLa boosts factual output from large pretrained LMs without retraining or external retrieval, giving immediate, low-cost improvements for truth-sensitive products like QA assistants and chatbots.

Who Should Care

Product Manager ML Engineer CTO Engineering Lead Founder

Summary TLDR

DoLa is a decoding trick that boosts factual outputs from pretrained transformer LMs without extra training or retrieval. At each token step it finds an earlier (“premature”) layer whose output most diverges from the final (“mature”) layer, subtracts the earlier-layer log-probabilities from the later-layer ones, applies a plausibility gate and repetition penalty, and samples from the result. This simple change raises truthfulness on multiple benchmarks (TruthfulQA, FACTOR, StrategyQA, GSM8K) for LLaMA models and MPT-7B, adds only ~1–8% decode latency, and needs only a forward pass.

Problem Statement

Large LMs hallucinate (produce incorrect facts). Fixes often need retrieval, supervision, or finetuning. The paper asks: can we reduce hallucinations at inference time, using only the model's internal layer signals, with low cost and no extra training?

Main Contribution

DoLa: a decoding method that contrasts logits from a dynamically chosen earlier layer and the final layer to surface factual knowledge.

A dynamic premature-layer selector based on Jensen-Shannon divergence (JSD) that picks which early layer to contrast per token.

Key Findings

DoLa raises combined truthfulness×informativeness on open-ended TruthfulQA by about 12–17 absolute percentage points for LLaMA models.

Numbers12–17 pp improvement on %Truth*Info across LLaMA sizes (Table 1)

Practical UseYou can improve factual output quality of off-the-shelf LLaMA models at inference time without retraining; try DoLa to get large, immediate gains on truth-sensitive prompts.

Evidence RefTable 1

Contrasting layers helps factual tokens more than non-factual tokens: entity tokens show larger layer divergence than non-entity tokens.

NumbersCoNLL-2003 study: critical layer 0 for non-entities 75.6% vs entities 35.6%; higher layers more common for entities (Tab

Practical UseDoLa works by amplifying information that appears later in the model (facts/names); it is especially useful for questions requiring factual knowledge.

Evidence RefAppendix A / Table 4

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
%Truth*Info (TruthfulQA open-ended)	LLaMA-7B: baseline 30.4 → DoLa 42.1	30.4	+11.7	TruthfulQA (open-ended) / Table 1	DoLa improves LLaMA-7B %Truth*Info from 30.4 to 42.1 (Table 1)	Table 1
%Truth*Info (TruthfulQA open-ended)	Range: baseline→DoLa shows +12–17 pp across LLaMA sizes	—	12–17 pp	TruthfulQA (open-ended) / Table 1	Authors report 12–17 absolute points improvement across LLaMA sizes (Table 1)	Table 1

What To Try In 7 Days

Run DoLa on your production LLM as an inference-time option and compare truth/answer quality on a labeled subset.

Use the paper's JSD-based selector buckets to pick candidate layers (2–4 buckets) — minimal hyperparameter tuning.

Measure latency and memory impact: expect ~1–8% latency increase and small GPU overhead before wider rollout.

Optimization Features

Inference Optimization

DoLa is an inference-time decoding change (no finetuning)

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/voidism/DoLa

Data URLs

TruthfulQAFACTOR (News/Wiki)StrategyQAGSM8KVicuna QACoNLL-2003

Risks & Boundaries

Limitations

Only targets factuality; other properties (alignment, safety beyond truthfulness) not addressed.

Inference-only: does not correct misinformation the model learned during training.

When Not To Use

On small models (GPT2-sized) that lack distinct layerwise factual signals.

When the model must be grounded to an external, up-to-date knowledge source (DoLa cannot fetch new facts).

Failure Modes

May generate detailed but incorrect facts (false positives) in some cases.

Can increase repetition in long-chain-of-thought outputs unless a repetition penalty is applied.

Core Entities

Models

LLaMA-7BLLaMA-13BLLaMA-33BLLaMA-65BMPT-7BGPT2-Medium

Metrics

%Truth%Info%Truth*InfoMC1/MC2/MC3 (TruthfulQA multiple-choice variants)AccuracyLatency ms/tokenJensen-Shannon Divergence (JSD)

Datasets

TruthfulQAFACTOR (News/Wiki)StrategyQAGSM8KVicuna QACoNLL-2003 (analysis)

Benchmarks

TruthfulQAFACTORStrategyQAGSM8KVicuna QA

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

DoLa raises combined truthfulness×informativeness on open-ended TruthfulQA by about 12–17 absolute percentage points for LLaMA models.

Contrasting layers helps factual tokens more than non-factual tokens: entity tokens show larger layer divergence than non-entity tokens.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

Hallucinations in LLMs are diverse, theoretically inevitable, and must be managed with grounding and human oversight

Key finding

SimpleVQA — a 2,025-sample bilingual VQA benchmark that tests multimodal LLM factuality with atomic-fact probes

Key finding

LLMs misjudge mixed-context hallucinations: external retrieval helps but factual cases remain hard

Key finding

LLMs (GPT-3.5, GPT-4, PaLM-2) do not reliably judge factuality on the FRANK benchmark

Key finding