Overview
The paper runs controlled experiments on six datasets with a real LLM and reports clear numeric gains, but results are limited to selected QA/dialogue benchmarks and constrained hyperparameter ranges.
Citations1
Evidence Strength0.80
Confidence0.85
Risk Signals9
Trust Signals
Findings with numeric evidence: 7/7
Findings with evidence refs: 7/7
Results with explicit delta: 0/4
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 60%
Production readiness: 70%
Novelty: 50%
Why It Matters For Business
Choosing the right memory format and retriever raises agent accuracy and robustness for long documents; mixed memories plus iterative retrieval improve multi-hop and noisy scenarios while tuning retrieval size controls cost.
Who Should Care
Summary TLDR
This paper tests four structured memory formats (chunks, knowledge triples, atomic facts, summaries) plus a mixed combination across three retrieval methods (single-step, reranking, iterative) on six long-context QA and dialogue datasets. Main takeaways: mixed memory is the most balanced and noise‑robust; iterative retrieval usually gives the biggest accuracy gains; chunks/summaries work best for long-context tasks, while triples/atomic facts give better relational precision. Mixed+iterative hit F1=82.11% on HotPotQA and 68.15% on 2WikiMultihopQA. Code and data are on GitHub.
Problem Statement
Different ways of structuring and retrieving memory for LLM agents are widely used, but we lack a systematic comparison showing which memory formats and retrievers work best for specific long-context tasks and how robust they are to noise.
Main Contribution
A controlled empirical study comparing four structural memory types and a mixed memory, across three retrieval methods and six datasets.
Practical findings: mixed memory yields the most balanced performance and noise resilience; iterative retrieval usually outperforms single-step and reranking.
Key Findings
Mixed memory (chunks + triples + atomic facts + summaries) gives the most balanced performance across tasks.
Iterative retrieval consistently outperforms single-step and reranking on most evaluated datasets.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| F1 | 82.11% | — | — | HotPotQA (mixed memory + iterative retrieval) | Mixed memory + iterative retrieval achieved F1=82.11% | Table 1 |
| F1 | 68.15% | — | — | 2WikiMultihopQA (mixed memory + iterative retrieval) | Mixed memory + iterative retrieval achieved F1=68.15% | Table 1 |
What To Try In 7 Days
Implement mixed memory (chunks+triples+atomic+summary) for a key QA pipeline and compare F1 vs current store.
Swap single-step retriever for a small iterative loop (2–3 turns, T≈50) and measure accuracy vs latency.
Tune retrieved K to 50–100 and rerank a small top-R (≈10) rather than reranking huge candidate sets.
Agent Features
Memory
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Optimization Features
Token Efficiency
Reproducibility
Risks & Boundaries
Limitations
Experiments cover only multi-hop QA, single-hop QA, dialogue understanding, and reading comprehension.
Noise robustness tests use random noise documents only, not adversarial or contradictory noise.
When Not To Use
For domains not tested here (self-evolving agents, social simulations) because findings may not generalize.
When noise is adversarial or specifically contradictory (not evaluated).
Failure Modes
Retrieving too many candidates (very large K/R/T) can add irrelevant text and drop accuracy.
Iterative retrieval gives diminishing returns after ~3 iterations while increasing cost.

