GRATH: make a 7B LLM substantially more truthful using self-generated paired answers and DPO

Overview

Decision SnapshotReady For Pilot

The paper reports consistent TruthfulQA gains across multiple 7B models using public datasets and LoRA+DPO; experiments include ablations on domain gap and iteration effects but warn of overfitting and MC2 instability when iterated too long.

Citations2

Evidence Strength0.80

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Weixin Chen, Dawn Song, Bo Li

Links

Abstract / PDF

Why It Matters For Business

You can substantially reduce factual errors for deployed 7B models without costly human annotation by generating paired data and fine-tuning with DPO; this is fast and parameter-efficient with LoRA.

Who Should Care

ML Engineer Product Manager CTO Founder

Summary TLDR

GRATH is a lightweight post-processing recipe that makes a pretrained LLM more truthful without human annotations. It prompts the model to create pairs of plausible correct/incorrect answers to out-of-domain questions, then fine-tunes via Direct Preference Optimization (DPO) using LoRA. One iteration (generate then one DPO update) already yields large TruthfulQA gains for 7B models (e.g., Llama2-Chat-7B MC1 30.23% → 54.71%). The method trades small changes in other benchmarks, can overfit if iterated too long, and works best when generated answers match the test domain.

Problem Statement

LLMs still produce factually incorrect answers (hallucinations). Annotating large-scale question-answer truth data is costly. Can we use out-of-domain questions and self-generated paired answers to improve truthfulness without human labels?

Main Contribution

Propose GRATH: use model-generated correct/incorrect answer pairs + DPO to improve truthfulness without human labels.

Introduce gradual self-truthifying: iteratively refine generated correct answers and re-run DPO to boost gains.

Key Findings

GRATH lifts Llama2-Chat-7B MC1 from 30.23% to 54.71% and MC2 from 45.32% to 69.10% on TruthfulQA.

NumbersMC1 +24.48pp, MC2 +23.78pp (Table 1)

Practical UseYou can make a 7B model reach or exceed much larger models' truthfulness on TruthfulQA with moderate fine-tuning.

Evidence RefTable 1

GRATH improves Zephyr (7B) as well: MC1 42.23% → 53.86% (+11.63pp); MC2 57.83% → 66.73% (+8.90pp).

NumbersMC1 +11.63pp, MC2 +8.90pp (Table 1)

Practical UseThe approach generalizes to different 7B models—apply the same pipeline to other small/medium LLMs.

Evidence RefTable 1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
TruthfulQA MC1 (Llama2-Chat-7B)	54.71%	30.23% (pretrained)	+24.48pp	TruthfulQA MC1	GRATH Llama2 in Table 1	Table 1
TruthfulQA MC2 (Llama2-Chat-7B)	69.10%	45.32% (pretrained)	+23.78pp	TruthfulQA MC2	GRATH Llama2 in Table 1	Table 1

What To Try In 7 Days

Pick a 7B chat model and a diverse question set (e.g., ARC-Challenge).

Prompt the model with 6 in-domain few-shot examples to produce correct/incorrect pairs.

Fine-tune with DPO + LoRA for ~1000 steps (one DPO run ≈ 1 hour on A6000). Use T=1 iteration first and evaluate TruthfulQA.

Optimization Features

Training Optimization

LoRADPO objective (direct preference optimization)

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

Risk of overfitting if GRATH is iterated many times; MC2 can become NaN after repeated DPO.

Self-generated correct answers are sometimes ground-truth incorrect, so learning is relative not absolute.

When Not To Use

When absolute, auditable factual correctness is required without human verification.

If you cannot run parameter-efficient fine-tuning or lack GPU resources.

Failure Modes

Overfitting to self-generated pairs causing degraded fluency or NaN MC2 scores.

Model learning relative differences incorrectly if generated 'correct' answers are frequently wrong.

Core Entities

Models

Llama2-Chat-7BZephyrLlama2-Chat-13BLlama2-Chat-70BXwin-LM

Metrics

Accuracynormalized probability (MC2)

Datasets

TruthfulQAARC-ChallengeHellaSwagMMLU

Benchmarks

TruthfulQA MC1TruthfulQA MC2ARC-ChallengeHellaSwagMMLU

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

GRATH lifts Llama2-Chat-7B MC1 from 30.23% to 54.71% and MC2 from 45.32% to 69.10% on TruthfulQA.

GRATH improves Zephyr (7B) as well: MC1 42.23% → 53.86% (+11.63pp); MC2 57.83% → 66.73% (+8.90pp).

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

Train a model to judge and correct its own facts with token-level rewards to cut hallucinations

Key finding

TruthHypo benchmark and KnowHD detector to measure and filter hallucinated scientific hypotheses

Key finding

Use weak or small models as judges: peer prediction rewards honesty and detects deception even when judges are far weaker

Key finding

Induce a model to hallucinate, then penalize those hallucinations at decoding to reduce LLM fabrications

Key finding