Which memory formats and retrievers best help LLM agents reason over long text

December 17, 20247 min

Overview

Production Readiness

0.7

Novelty Score

0.5

Cost Impact Score

0.6

Citation Count

1

Authors

Ruihong Zeng, Jinyuan Fang, Siwei Liu, Zaiqiao Meng

Links

Abstract / PDF

Why It Matters For Business

Choosing the right memory format and retriever raises agent accuracy and robustness for long documents; mixed memories plus iterative retrieval improve multi-hop and noisy scenarios while tuning retrieval size controls cost.

Summary TLDR

This paper tests four structured memory formats (chunks, knowledge triples, atomic facts, summaries) plus a mixed combination across three retrieval methods (single-step, reranking, iterative) on six long-context QA and dialogue datasets. Main takeaways: mixed memory is the most balanced and noise‑robust; iterative retrieval usually gives the biggest accuracy gains; chunks/summaries work best for long-context tasks, while triples/atomic facts give better relational precision. Mixed+iterative hit F1=82.11% on HotPotQA and 68.15% on 2WikiMultihopQA. Code and data are on GitHub.

Problem Statement

Different ways of structuring and retrieving memory for LLM agents are widely used, but we lack a systematic comparison showing which memory formats and retrievers work best for specific long-context tasks and how robust they are to noise.

Main Contribution

A controlled empirical study comparing four structural memory types and a mixed memory, across three retrieval methods and six datasets.

Practical findings: mixed memory yields the most balanced performance and noise resilience; iterative retrieval usually outperforms single-step and reranking.

Analysis of retrieval hyperparameters (K, R, T, N) and a short guideline on when to use Memory-Only vs Memory-Doc.

Key Findings

Mixed memory (chunks + triples + atomic facts + summaries) gives the most balanced performance across tasks.

NumbersF1=82.11% on HotPotQA, F1=68.15% on 2Wiki (iterative + mixed)

Iterative retrieval consistently outperforms single-step and reranking on most evaluated datasets.

NumbersMixed+iterative F1=82.11% (HotPotQA) > reranking/single-step

Chunks and summaries work better on tasks with very long contexts (reading comprehension, dialogue).

NumbersChunks accuracy=78.5% on QuALITY under reranking; summaries F1=32.26% on NarrativeQA (single-step)

Knowledge triples and atomic facts excel at relational precision and multi-hop reasoning.

NumbersTriples F1=62.06% on 2Wiki (iterative); atomic facts F1=81.29% on HotPotQA (iterative)

Mixed memory is more robust to added random noise documents than single memory types.

NumbersMixed memory declines slower across noise levels (Figure 8)

Retrieval hyperparameters have clear sweet spots: moderate K/R/T and ~2–3 iterative turns give most gains.

NumbersPerformance peaks around K=50–100, R≈10, T≈50, N=2–3 (Figures 4–7)

Memory-Doc helps tasks needing broad document context; Memory-Only helps precision tasks.

NumbersQualitative comparison in Figure 3 (document retrieval improves context-heavy tasks)

Results

F1

Value82.11%

F1

Value68.15%

F1

Value81.29%

Accuracy

Value78.5%

Who Should Care

What To Try In 7 Days

Implement mixed memory (chunks+triples+atomic+summary) for a key QA pipeline and compare F1 vs current store.

Swap single-step retriever for a small iterative loop (2–3 turns, T≈50) and measure accuracy vs latency.

Tune retrieved K to 50–100 and rerank a small top-R (≈10) rather than reranking huge candidate sets.

Agent Features

Memory

  • structural_memory
  • mixed_memory
  • chunks
  • knowledge_triples
  • atomic_facts
  • summaries

Tool Use

  • retriever
  • LLM reranker
  • document fetch (Memory-Doc)

Frameworks

  • LangChain

Is Agentic

true

Architectures

  • LLM-based agent

Optimization Features

Token Efficiency

  • summary compression
  • chunking to limit token window

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Experiments cover only multi-hop QA, single-hop QA, dialogue understanding, and reading comprehension.
  • Noise robustness tests use random noise documents only, not adversarial or contradictory noise.
  • Hyperparameter ranges (K, R, T, N) were limited by compute resources.

When Not To Use

  • For domains not tested here (self-evolving agents, social simulations) because findings may not generalize.
  • When noise is adversarial or specifically contradictory (not evaluated).
  • If you lack compute for iterative retrieval rounds and large retriever/reranker costs.

Failure Modes

  • Retrieving too many candidates (very large K/R/T) can add irrelevant text and drop accuracy.
  • Iterative retrieval gives diminishing returns after ~3 iterations while increasing cost.
  • Mixed memories add storage and indexing complexity; poor prompt design can produce low-quality triples/facts.

Core Entities

Models

  • GPT-4o-mini-128k
  • text-embedding-3-small

Metrics

  • Exact Match
  • F1
  • Accuracy

Datasets

  • HotPotQA
  • 2WikiMultihopQA
  • MuSiQue
  • NarrativeQA
  • LoCoMo
  • QuALITY

Benchmarks

  • long-context QA
  • multi-hop QA
  • reading comprehension
  • dialogue understanding

Context Entities

Benchmarks

  • Retrieval-Augmented Generation (RAG)