CANOE: use synthetic short QA + rule-based RL to cut hallucinations and improve long-form faithfulness

May 22, 20257 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

0

Authors

Shuzheng Si, Haozhe Zhao, Cheng Gao, Yuzhuo Bai, Zhitong Wang, Bofei Gao, Kangyang Luo, Wenhao Li, Yufei Huang, Gang Chen, Fanchao Qi, Minjia Zhang, Baobao Chang, Maosong Sun

Links

Abstract / PDF

Why It Matters For Business

CANOE improves context-grounded answers without human labels, lowering hallucination risk for production assistants and RAG systems while keeping costs down by tuning smaller open models.

Summary TLDR

CANOE is a post-training recipe that synthesizes 10k short, easy-to-verify QA pairs from Wikidata and uses Dual-GRPO, a rule-based reinforcement learning (RL) method, to teach models to stay faithful to context. Dual-GRPO rewards (1) short-form answer accuracy, (2) a proxy check that a generated long answer yields the correct short answer, and (3) output format. Applied to LLaMA-3 and Qwen-2.5 families, CANOE substantially reduces context hallucinations across 11 tasks and improves long-form quality without human preference labels.

Problem Statement

LLMs often ignore or contradict provided context (faithfulness hallucinations). Existing fixes either need human labels, are task-specific, or fail to improve long-form outputs. We need a scalable, annotation-free post-training method that boosts faithfulness across short and long outputs.

Main Contribution

CANOE: a post-training pipeline that trains LLMs to be context-faithful using only synthetic short-form QA data and rule-based RL

Dual-GRPO: a GRPO variant with three rule-based rewards (accuracy on short answers, proxy reward for long answers, and format reward) to jointly optimize long and short outputs

A synthetic dataset of 10k diverse short-form QA samples from Wikidata covering straightforward, multi-hop, inconsistent, and counterfactual contexts

Comprehensive evaluation on 11 short- and long-form tasks showing large faithfulness gains and improved long-form quality

Key Findings

CANOE raised average EM/Acc across 11 faithfulness tasks for LLaMA-3-Instruct-8B by +22.6 percentage points.

NumbersAvg EM +22.6% (LLaMA-3-8B), Table 1

CANOE-LLaMA-8B outperformed GPT-4o on the paper's averaged faithfulness score across the evaluated tasks.

NumbersCANOE-LLaMA-8B Avg ≈70.3% vs GPT-4o Avg ≈58.8%, Table 1

Long-form answer quality improved after CANOE by ~15 points for LLaMA-3-Instruct-8B (QualityScore).

NumbersQualityScore: LLaMA-8B 64.3 → CANOE 79.7 (∆ +15.4), Table 2

Using 10k synthesized short-form QA samples was sufficient; gains plateau beyond 10k.

NumbersPerformance stabilizes when training data >10k, Figure 6

Dual-GRPO prevents a failure mode where RL overfits short answers and breaks long-form outputs.

NumbersGRPO-only led to invalid long-form outputs and low QualityScore (ablation, Table 4)

Results

Avg EM/Acc across 11 tasks (LLaMA-3-Instruct-8B)

Value70.3% (CANOE) vs 47.7% (vanilla)

BaselineLLaMA-3-Instruct-8B vanilla

Avg EM/Acc across 11 tasks (Qwen-2.5-Instruct-7B)

Value68.0% (CANOE) vs 49.0% (vanilla)

BaselineQwen-2.5-Instruct-7B vanilla

Long-form Quality (QualityScore averaged)

Value79.7 (CANOE-LLaMA-8B) vs 64.3 (vanilla LLaMA-8B)

BaselineLLaMA-3-Instruct-8B vanilla

Accuracy

ValueAccuracy 70.3%, Proxy 66.1%, Format ~99.4%

Who Should Care

What To Try In 7 Days

Generate ~10k KB-backed short QA pairs (Wikidata triples + LLM synthesis).

Run Dual-GRPO on your 7–8B instruction model with accuracy, proxy, and format rewards.

Measure faithfulness with MiniCheck and a small human spot check on long-form outputs.

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Synthetic data covers many common relations but may miss domain-specific knowledge; domain data still needed for niche verticals.
  • Proxy reward assumes the long answer contains a recoverable short answer; this can fail if the long answer is ambiguous.
  • Evaluation relies on MiniCheck and GPT-4o judges; automated judges have blind spots and may bias reported gains.

When Not To Use

  • When closed-book factual accuracy (parametric knowledge) is the sole goal—CANOE focuses on context grounding, not memorized facts.
  • For extremely low-latency deployments where RL fine-tuning and inference-cost increases are unaffordable.
  • If you cannot produce or verify short-form ground-truth answers from a knowledge base.

Failure Modes

  • Reward hacking: model could optimize format/short answer matching and still produce misleading long answers if proxy check is weak.
  • Overfitting to synthetic patterns: synthesized contexts may leave distribution gaps to real user data.
  • Judge bias: gains measured by MiniCheck/GPT-4o may not fully reflect human trust in all domains.

Core Entities

Models

  • LLaMA-3-Instruct
  • Qwen-2.5-Instruct
  • GPT-4o
  • GPT-4o-mini
  • OpenAI o1
  • Claude 3.7 Sonnet
  • DeepSeek R1
  • DeepSeek V3

Metrics

  • Exact Match (EM)
  • Accuracy
  • FaithScore (MiniCheck)
  • QualityScore (GPT-4o judge)
  • Perplexity (overconfidence analysis)

Datasets

  • ConFiQA
  • CNQ
  • FaithEval
  • FiQA
  • FollowRAG (NaturalQA, TriviaQA, HotpotQA, WebQSP)
  • XSum
  • WikiLarge
  • CLAPNQ
  • MultiFieldQA-zh
  • DuReader
  • VCSUM

Benchmarks

  • ConFiQA
  • FaithEval
  • FollowRAG
  • CLAPnQ
  • XSum