GRATH: make a 7B LLM substantially more truthful using self-generated paired answers and DPO

January 22, 20247 min

Overview

Decision SnapshotReady For Pilot

The paper reports consistent TruthfulQA gains across multiple 7B models using public datasets and LoRA+DPO; experiments include ablations on domain gap and iteration effects but warn of overfitting and MC2 instability when iterated too long.

Citations2

Evidence Strength0.80

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Weixin Chen, Dawn Song, Bo Li

Links

Abstract / PDF

Why It Matters For Business

You can substantially reduce factual errors for deployed 7B models without costly human annotation by generating paired data and fine-tuning with DPO; this is fast and parameter-efficient with LoRA.

Who Should Care

Summary TLDR

GRATH is a lightweight post-processing recipe that makes a pretrained LLM more truthful without human annotations. It prompts the model to create pairs of plausible correct/incorrect answers to out-of-domain questions, then fine-tunes via Direct Preference Optimization (DPO) using LoRA. One iteration (generate then one DPO update) already yields large TruthfulQA gains for 7B models (e.g., Llama2-Chat-7B MC1 30.23% → 54.71%). The method trades small changes in other benchmarks, can overfit if iterated too long, and works best when generated answers match the test domain.

Problem Statement

LLMs still produce factually incorrect answers (hallucinations). Annotating large-scale question-answer truth data is costly. Can we use out-of-domain questions and self-generated paired answers to improve truthfulness without human labels?

Main Contribution

Propose GRATH: use model-generated correct/incorrect answer pairs + DPO to improve truthfulness without human labels.

Introduce gradual self-truthifying: iteratively refine generated correct answers and re-run DPO to boost gains.

Key Findings

GRATH lifts Llama2-Chat-7B MC1 from 30.23% to 54.71% and MC2 from 45.32% to 69.10% on TruthfulQA.

NumbersMC1 +24.48pp, MC2 +23.78pp (Table 1)

Practical UseYou can make a 7B model reach or exceed much larger models' truthfulness on TruthfulQA with moderate fine-tuning.

Evidence RefTable 1

GRATH improves Zephyr (7B) as well: MC1 42.23% → 53.86% (+11.63pp); MC2 57.83% → 66.73% (+8.90pp).

NumbersMC1 +11.63pp, MC2 +8.90pp (Table 1)

Practical UseThe approach generalizes to different 7B models—apply the same pipeline to other small/medium LLMs.

Evidence RefTable 1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
TruthfulQA MC1 (Llama2-Chat-7B)54.71%30.23% (pretrained)+24.48ppTruthfulQA MC1GRATH Llama2 in Table 1Table 1
TruthfulQA MC2 (Llama2-Chat-7B)69.10%45.32% (pretrained)+23.78ppTruthfulQA MC2GRATH Llama2 in Table 1Table 1

What To Try In 7 Days

Pick a 7B chat model and a diverse question set (e.g., ARC-Challenge).

Prompt the model with 6 in-domain few-shot examples to produce correct/incorrect pairs.

Fine-tune with DPO + LoRA for ~1000 steps (one DPO run ≈ 1 hour on A6000). Use T=1 iteration first and evaluate TruthfulQA.

Optimization Features

Training Optimization
LoRADPO objective (direct preference optimization)

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Risk of overfitting if GRATH is iterated many times; MC2 can become NaN after repeated DPO.

Self-generated correct answers are sometimes ground-truth incorrect, so learning is relative not absolute.

When Not To Use

When absolute, auditable factual correctness is required without human verification.

If you cannot run parameter-efficient fine-tuning or lack GPU resources.

Failure Modes

Overfitting to self-generated pairs causing degraded fluency or NaN MC2 scores.

Model learning relative differences incorrectly if generated 'correct' answers are frequently wrong.

Core Entities

Models

Llama2-Chat-7BZephyrLlama2-Chat-13BLlama2-Chat-70BXwin-LM

Metrics

Accuracynormalized probability (MC2)

Datasets

TruthfulQAARC-ChallengeHellaSwagMMLU

Benchmarks

TruthfulQA MC1TruthfulQA MC2ARC-ChallengeHellaSwagMMLU