Overview
CON is a practical note-taking layer that improves noise handling and abstention; results are consistent across several datasets but rely on GPT-4 labels and DPR/Wikipedia setup.
Citations9
Evidence Strength0.70
Confidence0.85
Risk Signals9
Trust Signals
Findings with numeric evidence: 3/3
Findings with evidence refs: 3/3
Results with explicit delta: 6/6
Reproducibility
Status: No open assets linked
Open source: Partial
At A Glance
Cost impact: 50%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
CON reduces incorrect answers caused by irrelevant retrieval and helps systems safely abstain on out-of-date or unknown queries, improving reliability in search and customer-facing QA products.
Who Should Care
Summary TLDR
Chain-of-Note (CON) asks a reader model to generate short, sequential reading notes for each retrieved document, then synthesizes those notes into the final answer. CON helps models detect irrelevant retrieved passages, reduce hallucination, and explicitly reject questions outside their knowledge. GPT-4 prompts with CON beat chain-of-thought in retrieval settings. A 10K GPT-4-created dataset was used to fine-tune LLaMa-2 7B; CON gave small overall QA gains and large robustness gains against noisy retrieval and unknown (real-time) questions. Main practical trade-off: much slower decoding unless you use their hybrid training trick.
Problem Statement
Retrieval-augmented models can be misled by irrelevant or noisy retrieved documents and may ignore their own internal knowledge. They also lack a reliable way to abstain ('unknown') when neither parametric nor retrieved knowledge supports an answer.
Main Contribution
Introduce CHAIN-OF-NOTE (CON): generate per-document reading notes, then synthesize answer from notes.
Create 10K CON training examples using GPT-4 and fine-tune LLaMa-2 7B to learn note-taking.
Key Findings
CON improves average Exact Match over standard retrieve-then-read models when fine-tuning LLaMa-2 7B.
CON greatly reduces the harm from fully noisy retrieved documents.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| EM (LLaMa-2 7B, average across datasets) | 50.46 | Retrieve-Read 48.49 | +1.97 | NQ/TriviaQA/WebQ full test (Table 2) | CON vs Retrieve-Read on LLaMa-2 7B | Table 2 |
| EM (GPT-4, average) | 65.7 | Retrieve-Read 63.1 | +2.6 | NQ/TriviaQA/WebQ full test (Table 2) | Zero-shot GPT-4 prompts with CON | Table 2 |
What To Try In 7 Days
Prompt GPT-4 to produce per-document reading notes on a sample of your retrieval outputs and inspect whether notes flag irrelevant docs.
Fine-tune a small LLaMa-2 style model on a few hundred human-reviewed note examples to test internalized CON behavior.
Run an A/B on queries where retriever quality is poor to measure EM/F1 and abstention (RR) improvements.
Agent Features
Memory
Planning
Tool Use
Frameworks
Optimization Features
Infra Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Direct CON decoding is much slower (~12s vs 0.6s) and impractical without hybrid training.
10K training data is synthesized by GPT-4; method quality depends on teacher prompts and may inherit biases.
When Not To Use
Latency-sensitive production paths without hybrid training
Systems with no external retriever or where retriever is already near-perfect
Failure Modes
If notes themselves are misleading, the synthesize step can still produce hallucinations.
Teacher-generated labels may encode systematic errors that the fine-tuned model reproduces.

