At decode time, subtract earlier-layer logits from later-layer logits to reduce hallucinations.

September 7, 20237 min

Overview

Production Readiness

0.7

Novelty Score

0.55

Cost Impact Score

0.15

Citation Count

17

Authors

Yung-Sung Chuang, Yujia Xie, Hongyin Luo, Yoon Kim, James Glass, Pengcheng He

Links

Abstract / PDF

Why It Matters For Business

DoLa boosts factual output from large pretrained LMs without retraining or external retrieval, giving immediate, low-cost improvements for truth-sensitive products like QA assistants and chatbots.

Summary TLDR

DoLa is a decoding trick that boosts factual outputs from pretrained transformer LMs without extra training or retrieval. At each token step it finds an earlier (“premature”) layer whose output most diverges from the final (“mature”) layer, subtracts the earlier-layer log-probabilities from the later-layer ones, applies a plausibility gate and repetition penalty, and samples from the result. This simple change raises truthfulness on multiple benchmarks (TruthfulQA, FACTOR, StrategyQA, GSM8K) for LLaMA models and MPT-7B, adds only ~1–8% decode latency, and needs only a forward pass.

Problem Statement

Large LMs hallucinate (produce incorrect facts). Fixes often need retrieval, supervision, or finetuning. The paper asks: can we reduce hallucinations at inference time, using only the model's internal layer signals, with low cost and no extra training?

Main Contribution

DoLa: a decoding method that contrasts logits from a dynamically chosen earlier layer and the final layer to surface factual knowledge.

A dynamic premature-layer selector based on Jensen-Shannon divergence (JSD) that picks which early layer to contrast per token.

Empirical gains in truthfulness across short-answer and open-ended benchmarks (TruthfulQA, FACTOR) and improved chain-of-thought reasoning (StrategyQA, GSM8K) without finetuning.

Practicality evidence: single-model forward passes, negligible memory overhead, and small latency increase (≈1–8%).

Public code release (GitHub) for replication and adoption.

Key Findings

DoLa raises combined truthfulness×informativeness on open-ended TruthfulQA by about 12–17 absolute percentage points for LLaMA models.

Numbers12–17 pp improvement on %Truth*Info across LLaMA sizes (Table 1)

Contrasting layers helps factual tokens more than non-factual tokens: entity tokens show larger layer divergence than non-entity tokens.

NumbersCoNLL-2003 study: critical layer 0 for non-entities 75.6% vs entities 35.6%; higher layers more common for entities (Tab

DoLa only slightly increases decoding cost: latency per token rises by 1%–8% across model sizes.

NumbersLatency multiplier 1.01–1.08; e.g., LLaMA-13B 77.3ms→83.1ms (×1.08) (Table 2)

DoLa fails on small LMs and can hurt performance there.

NumbersGPT2-Medium: baseline MC2 41.9% → DoLa 41.4%; FACTOR News 41.0% → 22.2% (Table 17)

Results

%Truth*Info (TruthfulQA open-ended)

ValueLLaMA-7B: baseline 30.4 → DoLa 42.1

Baseline30.4

%Truth*Info (TruthfulQA open-ended)

ValueRange: baseline→DoLa shows +12–17 pp across LLaMA sizes

Accuracy

ValueLLaMA-33B: baseline 33.8 → DoLa 35.5

Baseline33.8

Decoding latency

ValueLatency per token increases by 1%–8% depending on model

Layer divergence (critical layer distribution)

ValueEntity vs non-entity critical layer 0: 35.6% vs 75.6%

Who Should Care

What To Try In 7 Days

Run DoLa on your production LLM as an inference-time option and compare truth/answer quality on a labeled subset.

Use the paper's JSD-based selector buckets to pick candidate layers (2–4 buckets) — minimal hyperparameter tuning.

Measure latency and memory impact: expect ~1–8% latency increase and small GPU overhead before wider rollout.

Optimization Features

Inference Optimization

  • DoLa is an inference-time decoding change (no finetuning)

Reproducibility

Data Urls

  • TruthfulQA
  • FACTOR (News/Wiki)
  • StrategyQA
  • GSM8K
  • Vicuna QA
  • CoNLL-2003

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Only targets factuality; other properties (alignment, safety beyond truthfulness) not addressed.
  • Inference-only: does not correct misinformation the model learned during training.
  • Fails on small LMs: DoLa harms performance for models without rich layerwise knowledge (e.g., GPT2-medium).

When Not To Use

  • On small models (GPT2-sized) that lack distinct layerwise factual signals.
  • When the model must be grounded to an external, up-to-date knowledge source (DoLa cannot fetch new facts).
  • When zero added latency is mandatory (e.g., extreme low-latency edge devices).

Failure Modes

  • May generate detailed but incorrect facts (false positives) in some cases.
  • Can increase repetition in long-chain-of-thought outputs unless a repetition penalty is applied.
  • Relies on the model's internal knowledge — cannot correct entrenched training errors.

Core Entities

Models

  • LLaMA-7B
  • LLaMA-13B
  • LLaMA-33B
  • LLaMA-65B
  • MPT-7B
  • GPT2-Medium

Metrics

  • %Truth
  • %Info
  • %Truth*Info
  • MC1/MC2/MC3 (TruthfulQA multiple-choice variants)
  • Accuracy
  • Latency ms/token
  • Jensen-Shannon Divergence (JSD)

Datasets

  • TruthfulQA
  • FACTOR (News/Wiki)
  • StrategyQA
  • GSM8K
  • Vicuna QA
  • CoNLL-2003 (analysis)

Benchmarks

  • TruthfulQA
  • FACTOR
  • StrategyQA
  • GSM8K
  • Vicuna QA