Overview
The paper reports consistent TruthfulQA gains across multiple 7B models using public datasets and LoRA+DPO; experiments include ablations on domain gap and iteration effects but warn of overfitting and MC2 instability when iterated too long.
Citations2
Evidence Strength0.80
Confidence0.80
Risk Signals10
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 4/4
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 70%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
You can substantially reduce factual errors for deployed 7B models without costly human annotation by generating paired data and fine-tuning with DPO; this is fast and parameter-efficient with LoRA.
Who Should Care
Summary TLDR
GRATH is a lightweight post-processing recipe that makes a pretrained LLM more truthful without human annotations. It prompts the model to create pairs of plausible correct/incorrect answers to out-of-domain questions, then fine-tunes via Direct Preference Optimization (DPO) using LoRA. One iteration (generate then one DPO update) already yields large TruthfulQA gains for 7B models (e.g., Llama2-Chat-7B MC1 30.23% → 54.71%). The method trades small changes in other benchmarks, can overfit if iterated too long, and works best when generated answers match the test domain.
Problem Statement
LLMs still produce factually incorrect answers (hallucinations). Annotating large-scale question-answer truth data is costly. Can we use out-of-domain questions and self-generated paired answers to improve truthfulness without human labels?
Main Contribution
Propose GRATH: use model-generated correct/incorrect answer pairs + DPO to improve truthfulness without human labels.
Introduce gradual self-truthifying: iteratively refine generated correct answers and re-run DPO to boost gains.
Key Findings
GRATH lifts Llama2-Chat-7B MC1 from 30.23% to 54.71% and MC2 from 45.32% to 69.10% on TruthfulQA.
GRATH improves Zephyr (7B) as well: MC1 42.23% → 53.86% (+11.63pp); MC2 57.83% → 66.73% (+8.90pp).
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| TruthfulQA MC1 (Llama2-Chat-7B) | 54.71% | 30.23% (pretrained) | +24.48pp | TruthfulQA MC1 | GRATH Llama2 in Table 1 | Table 1 |
| TruthfulQA MC2 (Llama2-Chat-7B) | 69.10% | 45.32% (pretrained) | +23.78pp | TruthfulQA MC2 | GRATH Llama2 in Table 1 | Table 1 |
What To Try In 7 Days
Pick a 7B chat model and a diverse question set (e.g., ARC-Challenge).
Prompt the model with 6 in-domain few-shot examples to produce correct/incorrect pairs.
Fine-tune with DPO + LoRA for ~1000 steps (one DPO run ≈ 1 hour on A6000). Use T=1 iteration first and evaluate TruthfulQA.
Optimization Features
Training Optimization
Reproducibility
Risks & Boundaries
Limitations
Risk of overfitting if GRATH is iterated many times; MC2 can become NaN after repeated DPO.
Self-generated correct answers are sometimes ground-truth incorrect, so learning is relative not absolute.
When Not To Use
When absolute, auditable factual correctness is required without human verification.
If you cannot run parameter-efficient fine-tuning or lack GPU resources.
Failure Modes
Overfitting to self-generated pairs causing degraded fluency or NaN MC2 scores.
Model learning relative differences incorrectly if generated 'correct' answers are frequently wrong.

