Finetune LLMs on synthetic key-value tasks to improve long-context retrieval and reasoning without adding factual hallucinations

June 27, 20247 min

Overview

Decision SnapshotNeeds Validation

Evidence shows consistent gains on MDQA and FLenQA and small or no drops on general benchmarks, but experiments use a few seeds and limited model varieties so broader transfer should be validated.

Citations1

Evidence Strength0.70

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/3

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 50%

Authors

Zheyang Xiong, Vasilis Papageorgiou, Kangwook Lee, Dimitris Papailiopoulos

Links

Abstract / PDF

Why It Matters For Business

A small synthetic finetuning set can materially improve long-document retrieval and reasoning without adding factual hallucinations or hurting general abilities, making it a low-risk, low-cost upgrade for LLM products that handle long inputs.

Who Should Care

Summary TLDR

The authors finetune GPT-3.5 Turbo and Mistral 7B on a small synthetic dataset of numerical key-value retrieval tasks (simple and multi-subkey). Finetuning (often 2–3 epochs) improves retrieval across long contexts (MDQA) and reasoning on long inputs (FLenQA), reduces positional bias (lost-in-the-middle/primacy), and preserves performance on general benchmarks. Using an explicit answer template during finetuning helps. Synthetic data avoids introducing factual knowledge that can encourage hallucinations. Results are averaged across a few seeds and compared with other long-context augmentation datasets.

Problem Statement

Large language models lose accuracy when retrieving facts or reasoning over long contexts. Existing long-context datasets can help but sometimes introduce factual information that causes hallucinations. The paper asks: can a small, purely synthetic key-value retrieval dataset teach LLMs robust long-context retrieval and reasoning without harming general abilities?

Main Contribution

Design of two synthetic tasks: simple key-value retrieval and multi-subkey key-value retrieval, with optional answer templates.

Empirical finetuning recipe: small datasets (~150–350 samples, ~4K tokens each), 2–3 epochs, fine-tune on answer tokens.

Key Findings

Finetuning on synthetic key-value tasks improves long-context retrieval accuracy.

NumbersGPT-3.5 Turbo: +10.5% on 20-doc MDQA at position 10 (reported)

Practical UseIf your model misses answers in the middle of long inputs, finetune it on synthetic key-value retrieval samples to boost middle-position recall.

Evidence RefAbstract; Fig.5a

Synthetic finetuning often beats finetuning on the target MDQA data itself.

NumbersSynthetic ft > MDQA-ft on MDQA curves (Fig.5 comparisons)

Practical UseRather than collecting many real long-context QA pairs, a compact synthetic key-value set can yield better retrieval transfer for MDQA-like tasks.

Evidence RefSec.3.2.1; Fig.5

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
AccuracyGPT-3.5 Turbo +10.5% at position 10 after synthetic ftGPT-3.5 Turbo original+10.5%MDQA 20-doc, position 10Abstract; Fig.5aFig.5a
General benchmarks (Mistral-7B ft w/template)MMLU 53.44%, HellaSwag 56.22%, GSM8K 34.34%, TriviaQA 47.74%, NQ-Open 11.98%Mistral-7B original (MMLU 53.42, HellaSwag 56.31, GSM8K 34.65, TriviaQA 47.63, NQ-Open 11.61)changes within ±0.5% (listed per-table)MMLU/HellaSwag/GSM8K/TriviaQA/NQ-OpenTable 1; Sec.3.3Table 1

What To Try In 7 Days

Generate ~150–350 synthetic key-value retrieval prompts (4K tokens each).

Finetune your target LLM for 2–3 epochs on just the answer tokens (use an answer template).

Run MDQA-style tests with varying gold-document positions to check for positional bias fixes.

Optimization Features

Token Efficiency
each synthetic sample ~4K tokens to exercise long context
Model Optimization
finetune all attention layers on Mistral 7B
Training Optimization
small datasets (150–350 examples), 2–3 epochsglobal batch size 16, lr 5e-6 for Mistral 7B

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

Does not help when distractors are relevant (retrieved similar documents); no improvement on MDQA with relevant distractors.

Small training datasets and few model families tested; results may vary on larger models or other architectures.

When Not To Use

When the task requires adding or updating real factual knowledge.

When distractors are semantically similar or relevance-based (retrieved docs).

Failure Modes

No gain when distractors are relevant to the query.

Possible over-reliance on template format if production prompts differ.

Core Entities

Models

GPT-3.5 TurboMistral 7BMistral-7b-Instruct-v0.2

Metrics

Accuracymaximum subspan exact matchtoken-level loss

Datasets

Synthetic key-value retrieval (this paper)MDQAFLenQAMMLUHellaSwagGSM8KTriviaQANQ-OpenMultidocQAIN2Needle-in-a-haystack

Benchmarks

MDQAFLenQAMMLUHellaSwagGSM8KTriviaQANQ-Open

Context Entities

Models

GPT-3.5-turbo-1106Mistral-7B-Instruct-v0.1

Datasets

FLenQA (from Levy et al.)MDQA (from Liu et al.)