Overview
Production Readiness
0.7
Novelty Score
0.5
Cost Impact Score
0.6
Citation Count
1
Why It Matters For Business
Choosing the right memory format and retriever raises agent accuracy and robustness for long documents; mixed memories plus iterative retrieval improve multi-hop and noisy scenarios while tuning retrieval size controls cost.
Summary TLDR
This paper tests four structured memory formats (chunks, knowledge triples, atomic facts, summaries) plus a mixed combination across three retrieval methods (single-step, reranking, iterative) on six long-context QA and dialogue datasets. Main takeaways: mixed memory is the most balanced and noise‑robust; iterative retrieval usually gives the biggest accuracy gains; chunks/summaries work best for long-context tasks, while triples/atomic facts give better relational precision. Mixed+iterative hit F1=82.11% on HotPotQA and 68.15% on 2WikiMultihopQA. Code and data are on GitHub.
Problem Statement
Different ways of structuring and retrieving memory for LLM agents are widely used, but we lack a systematic comparison showing which memory formats and retrievers work best for specific long-context tasks and how robust they are to noise.
Main Contribution
A controlled empirical study comparing four structural memory types and a mixed memory, across three retrieval methods and six datasets.
Practical findings: mixed memory yields the most balanced performance and noise resilience; iterative retrieval usually outperforms single-step and reranking.
Analysis of retrieval hyperparameters (K, R, T, N) and a short guideline on when to use Memory-Only vs Memory-Doc.
Key Findings
Mixed memory (chunks + triples + atomic facts + summaries) gives the most balanced performance across tasks.
Iterative retrieval consistently outperforms single-step and reranking on most evaluated datasets.
Chunks and summaries work better on tasks with very long contexts (reading comprehension, dialogue).
Knowledge triples and atomic facts excel at relational precision and multi-hop reasoning.
Mixed memory is more robust to added random noise documents than single memory types.
Retrieval hyperparameters have clear sweet spots: moderate K/R/T and ~2–3 iterative turns give most gains.
Memory-Doc helps tasks needing broad document context; Memory-Only helps precision tasks.
Results
F1
F1
F1
Accuracy
Who Should Care
What To Try In 7 Days
Implement mixed memory (chunks+triples+atomic+summary) for a key QA pipeline and compare F1 vs current store.
Swap single-step retriever for a small iterative loop (2–3 turns, T≈50) and measure accuracy vs latency.
Tune retrieved K to 50–100 and rerank a small top-R (≈10) rather than reranking huge candidate sets.
Agent Features
Memory
- structural_memory
- mixed_memory
- chunks
- knowledge_triples
- atomic_facts
- summaries
Tool Use
- retriever
- LLM reranker
- document fetch (Memory-Doc)
Frameworks
- LangChain
Is Agentic
true
Architectures
- LLM-based agent
Optimization Features
Token Efficiency
- summary compression
- chunking to limit token window
Reproducibility
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Experiments cover only multi-hop QA, single-hop QA, dialogue understanding, and reading comprehension.
- Noise robustness tests use random noise documents only, not adversarial or contradictory noise.
- Hyperparameter ranges (K, R, T, N) were limited by compute resources.
When Not To Use
- For domains not tested here (self-evolving agents, social simulations) because findings may not generalize.
- When noise is adversarial or specifically contradictory (not evaluated).
- If you lack compute for iterative retrieval rounds and large retriever/reranker costs.
Failure Modes
- Retrieving too many candidates (very large K/R/T) can add irrelevant text and drop accuracy.
- Iterative retrieval gives diminishing returns after ~3 iterations while increasing cost.
- Mixed memories add storage and indexing complexity; poor prompt design can produce low-quality triples/facts.
Core Entities
Models
- GPT-4o-mini-128k
- text-embedding-3-small
Metrics
- Exact Match
- F1
- Accuracy
Datasets
- HotPotQA
- 2WikiMultihopQA
- MuSiQue
- NarrativeQA
- LoCoMo
- QuALITY
Benchmarks
- long-context QA
- multi-hop QA
- reading comprehension
- dialogue understanding
Context Entities
Benchmarks
- Retrieval-Augmented Generation (RAG)

