Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.5
Citation Count
0
Why It Matters For Business
CANOE improves context-grounded answers without human labels, lowering hallucination risk for production assistants and RAG systems while keeping costs down by tuning smaller open models.
Summary TLDR
CANOE is a post-training recipe that synthesizes 10k short, easy-to-verify QA pairs from Wikidata and uses Dual-GRPO, a rule-based reinforcement learning (RL) method, to teach models to stay faithful to context. Dual-GRPO rewards (1) short-form answer accuracy, (2) a proxy check that a generated long answer yields the correct short answer, and (3) output format. Applied to LLaMA-3 and Qwen-2.5 families, CANOE substantially reduces context hallucinations across 11 tasks and improves long-form quality without human preference labels.
Problem Statement
LLMs often ignore or contradict provided context (faithfulness hallucinations). Existing fixes either need human labels, are task-specific, or fail to improve long-form outputs. We need a scalable, annotation-free post-training method that boosts faithfulness across short and long outputs.
Main Contribution
CANOE: a post-training pipeline that trains LLMs to be context-faithful using only synthetic short-form QA data and rule-based RL
Dual-GRPO: a GRPO variant with three rule-based rewards (accuracy on short answers, proxy reward for long answers, and format reward) to jointly optimize long and short outputs
A synthetic dataset of 10k diverse short-form QA samples from Wikidata covering straightforward, multi-hop, inconsistent, and counterfactual contexts
Comprehensive evaluation on 11 short- and long-form tasks showing large faithfulness gains and improved long-form quality
Key Findings
CANOE raised average EM/Acc across 11 faithfulness tasks for LLaMA-3-Instruct-8B by +22.6 percentage points.
CANOE-LLaMA-8B outperformed GPT-4o on the paper's averaged faithfulness score across the evaluated tasks.
Long-form answer quality improved after CANOE by ~15 points for LLaMA-3-Instruct-8B (QualityScore).
Using 10k synthesized short-form QA samples was sufficient; gains plateau beyond 10k.
Dual-GRPO prevents a failure mode where RL overfits short answers and breaks long-form outputs.
Results
Avg EM/Acc across 11 tasks (LLaMA-3-Instruct-8B)
Avg EM/Acc across 11 tasks (Qwen-2.5-Instruct-7B)
Long-form Quality (QualityScore averaged)
Accuracy
Who Should Care
What To Try In 7 Days
Generate ~10k KB-backed short QA pairs (Wikidata triples + LLM synthesis).
Run Dual-GRPO on your 7–8B instruction model with accuracy, proxy, and format rewards.
Measure faithfulness with MiniCheck and a small human spot check on long-form outputs.
Reproducibility
Code Urls
Data Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Synthetic data covers many common relations but may miss domain-specific knowledge; domain data still needed for niche verticals.
- Proxy reward assumes the long answer contains a recoverable short answer; this can fail if the long answer is ambiguous.
- Evaluation relies on MiniCheck and GPT-4o judges; automated judges have blind spots and may bias reported gains.
When Not To Use
- When closed-book factual accuracy (parametric knowledge) is the sole goal—CANOE focuses on context grounding, not memorized facts.
- For extremely low-latency deployments where RL fine-tuning and inference-cost increases are unaffordable.
- If you cannot produce or verify short-form ground-truth answers from a knowledge base.
Failure Modes
- Reward hacking: model could optimize format/short answer matching and still produce misleading long answers if proxy check is weak.
- Overfitting to synthetic patterns: synthesized contexts may leave distribution gaps to real user data.
- Judge bias: gains measured by MiniCheck/GPT-4o may not fully reflect human trust in all domains.
Core Entities
Models
- LLaMA-3-Instruct
- Qwen-2.5-Instruct
- GPT-4o
- GPT-4o-mini
- OpenAI o1
- Claude 3.7 Sonnet
- DeepSeek R1
- DeepSeek V3
Metrics
- Exact Match (EM)
- Accuracy
- FaithScore (MiniCheck)
- QualityScore (GPT-4o judge)
- Perplexity (overconfidence analysis)
Datasets
- ConFiQA
- CNQ
- FaithEval
- FiQA
- FollowRAG (NaturalQA, TriviaQA, HotpotQA, WebQSP)
- XSum
- WikiLarge
- CLAPNQ
- MultiFieldQA-zh
- DuReader
- VCSUM
Benchmarks
- ConFiQA
- FaithEval
- FollowRAG
- CLAPnQ
- XSum

