Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
2
Why It Matters For Business
You can substantially reduce factual errors for deployed 7B models without costly human annotation by generating paired data and fine-tuning with DPO; this is fast and parameter-efficient with LoRA.
Summary TLDR
GRATH is a lightweight post-processing recipe that makes a pretrained LLM more truthful without human annotations. It prompts the model to create pairs of plausible correct/incorrect answers to out-of-domain questions, then fine-tunes via Direct Preference Optimization (DPO) using LoRA. One iteration (generate then one DPO update) already yields large TruthfulQA gains for 7B models (e.g., Llama2-Chat-7B MC1 30.23% → 54.71%). The method trades small changes in other benchmarks, can overfit if iterated too long, and works best when generated answers match the test domain.
Problem Statement
LLMs still produce factually incorrect answers (hallucinations). Annotating large-scale question-answer truth data is costly. Can we use out-of-domain questions and self-generated paired answers to improve truthfulness without human labels?
Main Contribution
Propose GRATH: use model-generated correct/incorrect answer pairs + DPO to improve truthfulness without human labels.
Introduce gradual self-truthifying: iteratively refine generated correct answers and re-run DPO to boost gains.
Empirically show large truthfulness gains on TruthfulQA (MC1/MC2) for multiple 7B models while largely preserving core capabilities.
Key Findings
GRATH lifts Llama2-Chat-7B MC1 from 30.23% to 54.71% and MC2 from 45.32% to 69.10% on TruthfulQA.
GRATH improves Zephyr (7B) as well: MC1 42.23% → 53.86% (+11.63pp); MC2 57.83% → 66.73% (+8.90pp).
GRATH preserves core capabilities with small effect: ARC +5.03pp, HellaSwag +1.13pp, MMLU −1.26pp for Llama2-Chat-7B.
DPO's effectiveness falls when the domain gap grows; better few-shot demonstrations (in-domain style) produce better generated pairs and higher truthfulness.
Larger distributional/pairwise distance between correct and incorrect answers correlates with higher truth gains.
Results
TruthfulQA MC1 (Llama2-Chat-7B)
TruthfulQA MC2 (Llama2-Chat-7B)
TruthfulQA MC1 (Zephyr)
Core benchmarks (ARC / HellaSwag / MMLU)
Who Should Care
What To Try In 7 Days
Pick a 7B chat model and a diverse question set (e.g., ARC-Challenge).
Prompt the model with 6 in-domain few-shot examples to produce correct/incorrect pairs.
Fine-tune with DPO + LoRA for ~1000 steps (one DPO run ≈ 1 hour on A6000). Use T=1 iteration first and evaluate TruthfulQA.
Optimization Features
Training Optimization
- LoRA
- DPO objective (direct preference optimization)
Reproducibility
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Risk of overfitting if GRATH is iterated many times; MC2 can become NaN after repeated DPO.
- Self-generated correct answers are sometimes ground-truth incorrect, so learning is relative not absolute.
- Effectiveness depends on match between few-shot demonstrations and target domain (domain gap).
- No public code release reported in paper, which may slow reproduction.
When Not To Use
- When absolute, auditable factual correctness is required without human verification.
- If you cannot run parameter-efficient fine-tuning or lack GPU resources.
- If you cannot tolerate any potential small degradations on other benchmarks.
Failure Modes
- Overfitting to self-generated pairs causing degraded fluency or NaN MC2 scores.
- Model learning relative differences incorrectly if generated 'correct' answers are frequently wrong.
- Domain mismatch between demonstrations and target queries reduces gains.
Core Entities
Models
- Llama2-Chat-7B
- Zephyr
- Llama2-Chat-13B
- Llama2-Chat-70B
- Xwin-LM
Metrics
- Accuracy
- normalized probability (MC2)
Datasets
- TruthfulQA
- ARC-Challenge
- HellaSwag
- MMLU
Benchmarks
- TruthfulQA MC1
- TruthfulQA MC2
- ARC-Challenge
- HellaSwag
- MMLU

