Overview
The paper combines a clear taxonomy with targeted experiments showing practical failure modes; recommendations are actionable but need further real-world tests.
Citations0
Evidence Strength0.80
Confidence0.80
Risk Signals12
Trust Signals
Findings with numeric evidence: 6/6
Findings with evidence refs: 6/6
Results with explicit delta: 5/5
Reproducibility
Status: Partial assets available
Open source: Unknown
At A Glance
Cost impact: 70%
Production readiness: 55%
Novelty: 65%
Why It Matters For Business
Memory systems add real operational cost and fragility; pick architectures that actually need external memory and measure token/time overhead before production.
Who Should Care
Summary TLDR
This survey organizes memory-augmented generation (MAG) into four structural classes, then measures where current systems fail in practice. Main problems: many benchmarks are already solvable inside large context windows, lexical metrics (e.g., F1) mis-rank abstractive systems, weaker open models corrupt structured writes, and structured memories add big latency and token costs. The paper proposes a Context Saturation Gap (Δ) test, recommends LLM-based semantic judging with multi-rubric checks, and calls for backbone-aware decoding plus asynchronous maintenance to make agentic memory practical.
Problem Statement
Memory-augmented agents promise long-term personalization and reasoning, but evaluation and system design lag behind. Benchmarks, metrics, model choice, and maintenance cost are often misaligned with real-world constraints, hiding when external memory actually helps and when it becomes an operational burden.
Main Contribution
A compact, structure-first taxonomy of agentic memory: Lightweight semantic, Entity-centric/personalized, Episodic/reflective, Structured/hierarchical.
Systematic empirical analysis of four failure axes: benchmark saturation, metric misalignment, backbone sensitivity, and maintenance/latency costs.
Key Findings
Lexical metrics (F1) can reverse practical rankings versus semantic judgment.
Open-weight backbones produce many more structured-format errors than strong APIs.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Lexical F1 vs Semantic Judge (ranking mismatch) | F1 Nemori 0.502 (rank1) vs MAGMA 0.467; Semantic judge MAGMA 0.670 (rank1) | F1-only evaluation | Ranking order changed | LoCoMo (Section 4.3) | Table 3 shows F1 and semantic judge scores across systems | Table 3 |
| Format error rate by backbone | gpt-4o-mini SimpleMem 1.20% / Nemori 17.91%; Qwen-2.5-3B SimpleMem 4.82% / Nemori 30.38% | API model (gpt-4o-mini) | +~3.6–12.47 percentage points higher on open model | Evaluation suite (Section 4.4) | Table 4 reports format error rates and answer scores | Table 4 |
What To Try In 7 Days
Run a Context Saturation Gap (Δ) on your tasks: compare MAG vs full-context baseline.
Replace F1 with an LLM-based semantic judge and test ranking stability across 2–3 rubrics.
Measure format error rates on your target backbone; add JSON/schema validators or constrained decoding if high errors appear.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Optimization Features
Token Efficiency
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Survey-level analysis uses selected representative systems; not an exhaustive empirical sweep.
LLM-as-a-judge depends on judge choice and still requires multi-rubric calibration.
When Not To Use
Tasks where all relevant info fits inside the model's context window (Δ ≈ 0).
Low-entity-diversity tasks with minimal temporal state to preserve.
Failure Modes
Silent write corruption: malformed JSON or hallucinated keys during memory updates.
Paraphrase penalty: correct abstractive answers receive low lexical scores.

