Finetune LLMs on synthetic key-value tasks to improve long-context retrieval and reasoning without adding factual hallucinations

June 27, 20247 min

Overview

Production Readiness

0.6

Novelty Score

0.5

Cost Impact Score

0.6

Citation Count

1

Authors

Zheyang Xiong, Vasilis Papageorgiou, Kangwook Lee, Dimitris Papailiopoulos

Links

Abstract / PDF

Why It Matters For Business

A small synthetic finetuning set can materially improve long-document retrieval and reasoning without adding factual hallucinations or hurting general abilities, making it a low-risk, low-cost upgrade for LLM products that handle long inputs.

Summary TLDR

The authors finetune GPT-3.5 Turbo and Mistral 7B on a small synthetic dataset of numerical key-value retrieval tasks (simple and multi-subkey). Finetuning (often 2–3 epochs) improves retrieval across long contexts (MDQA) and reasoning on long inputs (FLenQA), reduces positional bias (lost-in-the-middle/primacy), and preserves performance on general benchmarks. Using an explicit answer template during finetuning helps. Synthetic data avoids introducing factual knowledge that can encourage hallucinations. Results are averaged across a few seeds and compared with other long-context augmentation datasets.

Problem Statement

Large language models lose accuracy when retrieving facts or reasoning over long contexts. Existing long-context datasets can help but sometimes introduce factual information that causes hallucinations. The paper asks: can a small, purely synthetic key-value retrieval dataset teach LLMs robust long-context retrieval and reasoning without harming general abilities?

Main Contribution

Design of two synthetic tasks: simple key-value retrieval and multi-subkey key-value retrieval, with optional answer templates.

Empirical finetuning recipe: small datasets (~150–350 samples, ~4K tokens each), 2–3 epochs, fine-tune on answer tokens.

Demonstration that finetuning on synthetic data improves long-context retrieval (MDQA) and long-context reasoning (FLenQA), while not degrading general benchmarks and avoiding hallucination risk from factual finetuning.

Key Findings

Finetuning on synthetic key-value tasks improves long-context retrieval accuracy.

NumbersGPT-3.5 Turbo: +10.5% on 20-doc MDQA at position 10 (reported)

Synthetic finetuning often beats finetuning on the target MDQA data itself.

NumbersSynthetic ft > MDQA-ft on MDQA curves (Fig.5 comparisons)

Using an explicit answer template during finetuning improves learning and output consistency.

NumbersToken-level loss on answer formatting drops with template (Fig.4); template variants outperform non-template in MDQA/FLn

Synthetic finetuning does not harm general benchmarks and avoids hallucination seen in factual baselines.

NumbersMistral-7B ft (w/template): TriviaQA +0.11, NQ-Open +0.37 vs other baselines drops of -2.43 to -6.73

Synthetic finetuning improves long-context reasoning even without explicit chain-of-thought.

NumbersFLenQA accuracy rises for finetuned models in both chain-of-thought and direct-answer settings (Fig.6–7)

Results

Accuracy

ValueGPT-3.5 Turbo +10.5% at position 10 after synthetic ft

BaselineGPT-3.5 Turbo original

General benchmarks (Mistral-7B ft w/template)

ValueMMLU 53.44%, HellaSwag 56.22%, GSM8K 34.34%, TriviaQA 47.74%, NQ-Open 11.98%

BaselineMistral-7B original (MMLU 53.42, HellaSwag 56.31, GSM8K 34.65, TriviaQA 47.63, NQ-Open 11.61)

Degradation from factual baselines (Mistral-7B)

ValueNeedle-in-a-haystack: TriviaQA -6.33%, NQ-Open -6.73%

BaselineOriginal Mistral-7B

Who Should Care

What To Try In 7 Days

Generate ~150–350 synthetic key-value retrieval prompts (4K tokens each).

Finetune your target LLM for 2–3 epochs on just the answer tokens (use an answer template).

Run MDQA-style tests with varying gold-document positions to check for positional bias fixes.

Optimization Features

Token Efficiency

  • each synthetic sample ~4K tokens to exercise long context

Model Optimization

  • finetune all attention layers on Mistral 7B

Training Optimization

  • small datasets (150–350 examples), 2–3 epochs
  • global batch size 16, lr 5e-6 for Mistral 7B

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Does not help when distractors are relevant (retrieved similar documents); no improvement on MDQA with relevant distractors.
  • Small training datasets and few model families tested; results may vary on larger models or other architectures.
  • No public code or dataset link provided in paper for immediate reproduction.

When Not To Use

  • When the task requires adding or updating real factual knowledge.
  • When distractors are semantically similar or relevance-based (retrieved docs).
  • If you cannot perform any model finetuning on your deployment model.

Failure Modes

  • No gain when distractors are relevant to the query.
  • Possible over-reliance on template format if production prompts differ.
  • Baseline factual finetuning can still outperform on some target data but at risk of hallucination.

Core Entities

Models

  • GPT-3.5 Turbo
  • Mistral 7B
  • Mistral-7b-Instruct-v0.2

Metrics

  • Accuracy
  • maximum subspan exact match
  • token-level loss

Datasets

  • Synthetic key-value retrieval (this paper)
  • MDQA
  • FLenQA
  • MMLU
  • HellaSwag
  • GSM8K
  • TriviaQA
  • NQ-Open
  • MultidocQA
  • IN2
  • Needle-in-a-haystack

Benchmarks

  • MDQA
  • FLenQA
  • MMLU
  • HellaSwag
  • GSM8K
  • TriviaQA
  • NQ-Open

Context Entities

Models

  • GPT-3.5-turbo-1106
  • Mistral-7B-Instruct-v0.1

Datasets

  • FLenQA (from Levy et al.)
  • MDQA (from Liu et al.)