Have models write short reading notes per retrieved doc to ignore noise and say “unknown” when needed.

Overview

Decision SnapshotNeeds Validation

CON is a practical note-taking layer that improves noise handling and abstention; results are consistent across several datasets but rely on GPT-4 labels and DPR/Wikipedia setup.

Citations9

Evidence Strength0.70

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 6/6

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 60%

Authors

Wenhao Yu, Hongming Zhang, Xiaoman Pan, Kaixin Ma, Hongwei Wang, Dong Yu

Links

Abstract / PDF

Why It Matters For Business

CON reduces incorrect answers caused by irrelevant retrieval and helps systems safely abstain on out-of-date or unknown queries, improving reliability in search and customer-facing QA products.

Who Should Care

CTO Product Manager ML Engineer Data Scientist Engineering Lead

Summary TLDR

Chain-of-Note (CON) asks a reader model to generate short, sequential reading notes for each retrieved document, then synthesizes those notes into the final answer. CON helps models detect irrelevant retrieved passages, reduce hallucination, and explicitly reject questions outside their knowledge. GPT-4 prompts with CON beat chain-of-thought in retrieval settings. A 10K GPT-4-created dataset was used to fine-tune LLaMa-2 7B; CON gave small overall QA gains and large robustness gains against noisy retrieval and unknown (real-time) questions. Main practical trade-off: much slower decoding unless you use their hybrid training trick.

Problem Statement

Retrieval-augmented models can be misled by irrelevant or noisy retrieved documents and may ignore their own internal knowledge. They also lack a reliable way to abstain ('unknown') when neither parametric nor retrieved knowledge supports an answer.

Main Contribution

Introduce CHAIN-OF-NOTE (CON): generate per-document reading notes, then synthesize answer from notes.

Create 10K CON training examples using GPT-4 and fine-tune LLaMa-2 7B to learn note-taking.

Key Findings

CON improves average Exact Match over standard retrieve-then-read models when fine-tuning LLaMa-2 7B.

NumbersEM +1.97 avg across NQ/TriviaQA/WebQ (Table 2)

Practical UseIf you fine-tune a mid-size RALM, adding CON-style outputs can nudge overall QA accuracy up by ~2 EM points on common open-domain datasets.

Evidence RefTable 2

CON greatly reduces the harm from fully noisy retrieved documents.

NumbersEM +7.94 avg on fully noisy retrieval (Table 3)

Practical UseWhen your retriever returns mostly irrelevant documents, switching the reader to generate and use per-document notes can recover ~8 EM points versus a standard reader.

Evidence RefTable 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
EM (LLaMa-2 7B, average across datasets)	50.46	Retrieve-Read 48.49	+1.97	NQ/TriviaQA/WebQ full test (Table 2)	CON vs Retrieve-Read on LLaMa-2 7B	Table 2
EM (GPT-4, average)	65.7	Retrieve-Read 63.1	+2.6	NQ/TriviaQA/WebQ full test (Table 2)	Zero-shot GPT-4 prompts with CON	Table 2

What To Try In 7 Days

Prompt GPT-4 to produce per-document reading notes on a sample of your retrieval outputs and inspect whether notes flag irrelevant docs.

Fine-tune a small LLaMa-2 style model on a few hundred human-reviewed note examples to test internalized CON behavior.

Run an A/B on queries where retriever quality is poor to measure EM/F1 and abstention (RR) improvements.

Agent Features

Memory

retrieval memory (external docs)

Planning

sequential note generation per retrieved doc

Tool Use

DPR retrieverGPT-4 as teacher for data generationLLaMa-2 7B fine-tuning

Frameworks

CHAIN-OF-NOTE

Optimization Features

Infra Optimization

Training uses bfloat16 and DeepSpeed on multi-GPU (A100) setups

System Optimization

Greedy decoding for deterministic outputsUse DeepSpeed and ZeRO for fine-tuning

Training Optimization

Hybrid training: 50% standard QA + 50% CON to internalize notesUse GPT-4 to generate 10K training examples to avoid manual labelingBest learning rate reported: 5e-6

Inference Optimization

Hybrid-trained model yields near-identical inference time to baselineDirect CON inference is ~20x slower without hybrid (Table 5)

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

Direct CON decoding is much slower (~12s vs 0.6s) and impractical without hybrid training.

10K training data is synthesized by GPT-4; method quality depends on teacher prompts and may inherit biases.

When Not To Use

Latency-sensitive production paths without hybrid training

Systems with no external retriever or where retriever is already near-perfect

Failure Modes

If notes themselves are misleading, the synthesize step can still produce hallucinations.

Teacher-generated labels may encode systematic errors that the fine-tuned model reproduces.

Core Entities

Models

LLaMa-2 7BGPT-4

Metrics

Exact Match (EM)F1AccuracyReject Rate (RR)

Datasets

Natural Questions (NQ)TriviaQAWebQuestions (WebQ)RealTimeQA

Benchmarks

open-domain QA (Wikipedia/DPR)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

CON improves average Exact Match over standard retrieve-then-read models when fine-tuning LLaMa-2 7B.

CON greatly reduces the harm from fully noisy retrieved documents.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

MTRAG: a human-made benchmark of multi-turn RAG conversations that stresses retrieval, unanswerables, and later-turn context.

Key finding

Atomic fact-checking for medical RAG LLMs boosts factuality and traceability

Key finding

Build query-specific evidence graphs on the fly to fix missing links and filter distractor facts

Key finding

RAGLAB — an open, modular toolkit to reproduce, compare and develop RAG algorithms fairly

Key finding

InsQABench: a Chinese insurance QA benchmark plus SQL-ReAct and RAG-ReAct methods

Key finding