Have models write short reading notes per retrieved doc to ignore noise and say “unknown” when needed.

November 15, 20237 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

9

Authors

Wenhao Yu, Hongming Zhang, Xiaoman Pan, Kaixin Ma, Hongwei Wang, Dong Yu

Links

Abstract / PDF

Why It Matters For Business

CON reduces incorrect answers caused by irrelevant retrieval and helps systems safely abstain on out-of-date or unknown queries, improving reliability in search and customer-facing QA products.

Summary TLDR

Chain-of-Note (CON) asks a reader model to generate short, sequential reading notes for each retrieved document, then synthesizes those notes into the final answer. CON helps models detect irrelevant retrieved passages, reduce hallucination, and explicitly reject questions outside their knowledge. GPT-4 prompts with CON beat chain-of-thought in retrieval settings. A 10K GPT-4-created dataset was used to fine-tune LLaMa-2 7B; CON gave small overall QA gains and large robustness gains against noisy retrieval and unknown (real-time) questions. Main practical trade-off: much slower decoding unless you use their hybrid training trick.

Problem Statement

Retrieval-augmented models can be misled by irrelevant or noisy retrieved documents and may ignore their own internal knowledge. They also lack a reliable way to abstain ('unknown') when neither parametric nor retrieved knowledge supports an answer.

Main Contribution

Introduce CHAIN-OF-NOTE (CON): generate per-document reading notes, then synthesize answer from notes.

Create 10K CON training examples using GPT-4 and fine-tune LLaMa-2 7B to learn note-taking.

Show CON improves QA accuracy modestly and substantially improves robustness to noisy retrieval and unknown queries; propose hybrid training to avoid high inference cost.

Key Findings

CON improves average Exact Match over standard retrieve-then-read models when fine-tuning LLaMa-2 7B.

NumbersEM +1.97 avg across NQ/TriviaQA/WebQ (Table 2)

CON greatly reduces the harm from fully noisy retrieved documents.

NumbersEM +7.94 avg on fully noisy retrieval (Table 3)

CON increases the model's tendency to abstain on out-of-training-time (real-time) questions.

NumbersReject rate 6.1 → 13.0 (RR +6.9) on RealTimeQA (Table 4)

Results

EM (LLaMa-2 7B, average across datasets)

Value50.46

BaselineRetrieve-Read 48.49

EM (GPT-4, average)

Value65.7

BaselineRetrieve-Read 63.1

EM (fully noisy retrieval, average)

Value47.66

BaselineRetrieve-Read 39.72

Reject Rate (RR)

Value13.0

BaselineRetrieve-Read RR 6.1

Inference time per example

Value12.0192 s

BaselineRetrieve-Read 0.6104 s

Inference time per example (hybrid)

Value0.6074 s

BaselineRetrieve-Read 0.6104 s

Who Should Care

What To Try In 7 Days

Prompt GPT-4 to produce per-document reading notes on a sample of your retrieval outputs and inspect whether notes flag irrelevant docs.

Fine-tune a small LLaMa-2 style model on a few hundred human-reviewed note examples to test internalized CON behavior.

Run an A/B on queries where retriever quality is poor to measure EM/F1 and abstention (RR) improvements.

Agent Features

Memory

  • retrieval memory (external docs)

Planning

  • sequential note generation per retrieved doc

Tool Use

  • DPR retriever
  • GPT-4 as teacher for data generation
  • LLaMa-2 7B fine-tuning

Frameworks

  • CHAIN-OF-NOTE

Optimization Features

Infra Optimization

  • Training uses bfloat16 and DeepSpeed on multi-GPU (A100) setups

System Optimization

  • Greedy decoding for deterministic outputs
  • Use DeepSpeed and ZeRO for fine-tuning

Training Optimization

  • Hybrid training: 50% standard QA + 50% CON to internalize notes
  • Use GPT-4 to generate 10K training examples to avoid manual labeling
  • Best learning rate reported: 5e-6

Inference Optimization

  • Hybrid-trained model yields near-identical inference time to baseline
  • Direct CON inference is ~20x slower without hybrid (Table 5)

Reproducibility

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Direct CON decoding is much slower (~12s vs 0.6s) and impractical without hybrid training.
  • 10K training data is synthesized by GPT-4; method quality depends on teacher prompts and may inherit biases.
  • Experiments focus on DPR + Wikipedia; gains may vary with other retrievers or corpora.

When Not To Use

  • Latency-sensitive production paths without hybrid training
  • Systems with no external retriever or where retriever is already near-perfect
  • Where you cannot validate or curate GPT-4-generated training labels

Failure Modes

  • If notes themselves are misleading, the synthesize step can still produce hallucinations.
  • Teacher-generated labels may encode systematic errors that the fine-tuned model reproduces.
  • Hybrid training trades some robustness for speed; may underperform full CON on extreme-noise cases.

Core Entities

Models

  • LLaMa-2 7B
  • GPT-4

Metrics

  • Exact Match (EM)
  • F1
  • Accuracy
  • Reject Rate (RR)

Datasets

  • Natural Questions (NQ)
  • TriviaQA
  • WebQuestions (WebQ)
  • RealTimeQA

Benchmarks

  • open-domain QA (Wikipedia/DPR)