GRATH: make a 7B LLM substantially more truthful using self-generated paired answers and DPO

January 22, 20247 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

2

Authors

Weixin Chen, Dawn Song, Bo Li

Links

Abstract / PDF

Why It Matters For Business

You can substantially reduce factual errors for deployed 7B models without costly human annotation by generating paired data and fine-tuning with DPO; this is fast and parameter-efficient with LoRA.

Summary TLDR

GRATH is a lightweight post-processing recipe that makes a pretrained LLM more truthful without human annotations. It prompts the model to create pairs of plausible correct/incorrect answers to out-of-domain questions, then fine-tunes via Direct Preference Optimization (DPO) using LoRA. One iteration (generate then one DPO update) already yields large TruthfulQA gains for 7B models (e.g., Llama2-Chat-7B MC1 30.23% → 54.71%). The method trades small changes in other benchmarks, can overfit if iterated too long, and works best when generated answers match the test domain.

Problem Statement

LLMs still produce factually incorrect answers (hallucinations). Annotating large-scale question-answer truth data is costly. Can we use out-of-domain questions and self-generated paired answers to improve truthfulness without human labels?

Main Contribution

Propose GRATH: use model-generated correct/incorrect answer pairs + DPO to improve truthfulness without human labels.

Introduce gradual self-truthifying: iteratively refine generated correct answers and re-run DPO to boost gains.

Empirically show large truthfulness gains on TruthfulQA (MC1/MC2) for multiple 7B models while largely preserving core capabilities.

Key Findings

GRATH lifts Llama2-Chat-7B MC1 from 30.23% to 54.71% and MC2 from 45.32% to 69.10% on TruthfulQA.

NumbersMC1 +24.48pp, MC2 +23.78pp (Table 1)

GRATH improves Zephyr (7B) as well: MC1 42.23% → 53.86% (+11.63pp); MC2 57.83% → 66.73% (+8.90pp).

NumbersMC1 +11.63pp, MC2 +8.90pp (Table 1)

GRATH preserves core capabilities with small effect: ARC +5.03pp, HellaSwag +1.13pp, MMLU −1.26pp for Llama2-Chat-7B.

NumbersARC 52.73→57.76, HellaSwag 78.50→79.63, MMLU 48.14→46.88 (Table 1)

DPO's effectiveness falls when the domain gap grows; better few-shot demonstrations (in-domain style) produce better generated pairs and higher truthfulness.

NumbersPerformance drops as perturbation parameter topp increases (Figure 3); DPO Q_OOD_G(FS_IND) is best among variations (Fig

Larger distributional/pairwise distance between correct and incorrect answers correlates with higher truth gains.

NumbersMean pairwise distance rose 65.94 → 87.80 between DPO1 and DPO2; corresponding accuracy increased (Figure 5).

Results

TruthfulQA MC1 (Llama2-Chat-7B)

Value54.71%

Baseline30.23% (pretrained)

TruthfulQA MC2 (Llama2-Chat-7B)

Value69.10%

Baseline45.32% (pretrained)

TruthfulQA MC1 (Zephyr)

Value53.86%

Baseline42.23% (pretrained)

Core benchmarks (ARC / HellaSwag / MMLU)

ValueARC 57.76 / HellaSwag 79.63 / MMLU 46.88

BaselineARC 52.73 / HellaSwag 78.50 / MMLU 48.14

Who Should Care

What To Try In 7 Days

Pick a 7B chat model and a diverse question set (e.g., ARC-Challenge).

Prompt the model with 6 in-domain few-shot examples to produce correct/incorrect pairs.

Fine-tune with DPO + LoRA for ~1000 steps (one DPO run ≈ 1 hour on A6000). Use T=1 iteration first and evaluate TruthfulQA.

Optimization Features

Training Optimization

  • LoRA
  • DPO objective (direct preference optimization)

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Risk of overfitting if GRATH is iterated many times; MC2 can become NaN after repeated DPO.
  • Self-generated correct answers are sometimes ground-truth incorrect, so learning is relative not absolute.
  • Effectiveness depends on match between few-shot demonstrations and target domain (domain gap).
  • No public code release reported in paper, which may slow reproduction.

When Not To Use

  • When absolute, auditable factual correctness is required without human verification.
  • If you cannot run parameter-efficient fine-tuning or lack GPU resources.
  • If you cannot tolerate any potential small degradations on other benchmarks.

Failure Modes

  • Overfitting to self-generated pairs causing degraded fluency or NaN MC2 scores.
  • Model learning relative differences incorrectly if generated 'correct' answers are frequently wrong.
  • Domain mismatch between demonstrations and target queries reduces gains.

Core Entities

Models

  • Llama2-Chat-7B
  • Zephyr
  • Llama2-Chat-13B
  • Llama2-Chat-70B
  • Xwin-LM

Metrics

  • Accuracy
  • normalized probability (MC2)

Datasets

  • TruthfulQA
  • ARC-Challenge
  • HellaSwag
  • MMLU

Benchmarks

  • TruthfulQA MC1
  • TruthfulQA MC2
  • ARC-Challenge
  • HellaSwag
  • MMLU