Have models write short reading notes per retrieved doc to ignore noise and say “unknown” when needed.

November 15, 20237 min

Overview

Decision SnapshotNeeds Validation

CON is a practical note-taking layer that improves noise handling and abstention; results are consistent across several datasets but rely on GPT-4 labels and DPR/Wikipedia setup.

Citations9

Evidence Strength0.70

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 6/6

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 60%

Authors

Wenhao Yu, Hongming Zhang, Xiaoman Pan, Kaixin Ma, Hongwei Wang, Dong Yu

Links

Abstract / PDF

Why It Matters For Business

CON reduces incorrect answers caused by irrelevant retrieval and helps systems safely abstain on out-of-date or unknown queries, improving reliability in search and customer-facing QA products.

Who Should Care

Summary TLDR

Chain-of-Note (CON) asks a reader model to generate short, sequential reading notes for each retrieved document, then synthesizes those notes into the final answer. CON helps models detect irrelevant retrieved passages, reduce hallucination, and explicitly reject questions outside their knowledge. GPT-4 prompts with CON beat chain-of-thought in retrieval settings. A 10K GPT-4-created dataset was used to fine-tune LLaMa-2 7B; CON gave small overall QA gains and large robustness gains against noisy retrieval and unknown (real-time) questions. Main practical trade-off: much slower decoding unless you use their hybrid training trick.

Problem Statement

Retrieval-augmented models can be misled by irrelevant or noisy retrieved documents and may ignore their own internal knowledge. They also lack a reliable way to abstain ('unknown') when neither parametric nor retrieved knowledge supports an answer.

Main Contribution

Introduce CHAIN-OF-NOTE (CON): generate per-document reading notes, then synthesize answer from notes.

Create 10K CON training examples using GPT-4 and fine-tune LLaMa-2 7B to learn note-taking.

Key Findings

CON improves average Exact Match over standard retrieve-then-read models when fine-tuning LLaMa-2 7B.

NumbersEM +1.97 avg across NQ/TriviaQA/WebQ (Table 2)

Practical UseIf you fine-tune a mid-size RALM, adding CON-style outputs can nudge overall QA accuracy up by ~2 EM points on common open-domain datasets.

Evidence RefTable 2

CON greatly reduces the harm from fully noisy retrieved documents.

NumbersEM +7.94 avg on fully noisy retrieval (Table 3)

Practical UseWhen your retriever returns mostly irrelevant documents, switching the reader to generate and use per-document notes can recover ~8 EM points versus a standard reader.

Evidence RefTable 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
EM (LLaMa-2 7B, average across datasets)50.46Retrieve-Read 48.49+1.97NQ/TriviaQA/WebQ full test (Table 2)CON vs Retrieve-Read on LLaMa-2 7BTable 2
EM (GPT-4, average)65.7Retrieve-Read 63.1+2.6NQ/TriviaQA/WebQ full test (Table 2)Zero-shot GPT-4 prompts with CONTable 2

What To Try In 7 Days

Prompt GPT-4 to produce per-document reading notes on a sample of your retrieval outputs and inspect whether notes flag irrelevant docs.

Fine-tune a small LLaMa-2 style model on a few hundred human-reviewed note examples to test internalized CON behavior.

Run an A/B on queries where retriever quality is poor to measure EM/F1 and abstention (RR) improvements.

Agent Features

Memory
retrieval memory (external docs)
Planning
sequential note generation per retrieved doc
Tool Use
DPR retrieverGPT-4 as teacher for data generationLLaMa-2 7B fine-tuning
Frameworks
CHAIN-OF-NOTE

Optimization Features

Infra Optimization
Training uses bfloat16 and DeepSpeed on multi-GPU (A100) setups
System Optimization
Greedy decoding for deterministic outputsUse DeepSpeed and ZeRO for fine-tuning
Training Optimization
Hybrid training: 50% standard QA + 50% CON to internalize notesUse GPT-4 to generate 10K training examples to avoid manual labelingBest learning rate reported: 5e-6
Inference Optimization
Hybrid-trained model yields near-identical inference time to baselineDirect CON inference is ~20x slower without hybrid (Table 5)

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Direct CON decoding is much slower (~12s vs 0.6s) and impractical without hybrid training.

10K training data is synthesized by GPT-4; method quality depends on teacher prompts and may inherit biases.

When Not To Use

Latency-sensitive production paths without hybrid training

Systems with no external retriever or where retriever is already near-perfect

Failure Modes

If notes themselves are misleading, the synthesize step can still produce hallucinations.

Teacher-generated labels may encode systematic errors that the fine-tuned model reproduces.

Core Entities

Models

LLaMa-2 7BGPT-4

Metrics

Exact Match (EM)F1AccuracyReject Rate (RR)

Datasets

Natural Questions (NQ)TriviaQAWebQuestions (WebQ)RealTimeQA

Benchmarks

open-domain QA (Wikipedia/DPR)