Overview
This survey integrates prior work and practical evaluations; use it as an operational map but validate tool choices on your own latency, ingestion, and multi-session tests.
Citations0
Evidence Strength0.80
Confidence0.85
Risk Signals9
Trust Signals
Findings with numeric evidence: 4/5
Findings with evidence refs: 5/5
Results with explicit delta: 2/2
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 60%
Novelty: 50%
Why It Matters For Business
Memory systems let AI keep facts current, personalize across sessions, and reduce costly retraining; pick simple vector RAG for factual QA and reserve heavy platforms for offline analytics or research.
Who Should Care
Summary TLDR
This is a broad, engineering-focused survey of memory for large language models and multimodal agents. It organizes prior work into three clear families — implicit memory (knowledge inside weights), explicit memory (external stores and retrieval/RAG), and agentic memory (persistent short- and long-term memory for agents) — and reviews architectures, training recipes, benchmarks, tools, and open problems (scaling laws, hallucination, retrieval contamination, and latency vs. accuracy trade-offs). The paper includes concrete comparisons (LongMemEval) showing simple vector DBs deliver large factual gains, while full-featured memory platforms often add expensive ingestion costs.
Problem Statement
Modern LLMs need memory to adapt, stay up-to-date, and act over long interactions. The field lacks a unified map of what memory means, how it is represented (weights, vectors, graphs), how to efficiently train and edit it, how to evaluate agentic memory, and how to balance accuracy, latency, and contamination risk.
Main Contribution
A unified taxonomy of memory in (M)LLMs: implicit (weights), explicit (external stores/RAG), and agentic (persistent agent memory).
A survey of mechanisms to analyze, edit, and unlearn implicit (parametric) memory, including ROME, MEMIT, and related methods.
Key Findings
Memory in LLMs is usefully grouped into three families: implicit (weights), explicit (external stores), and agentic (persistent agent memory).
Parametric models have limited factual capacity; theory and experiments estimate roughly '2 bits of knowledge per parameter' and fact memorization scales poorly.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | ChromaDB 0.600 overall, 6.41s avg | No Memory 0.010 overall, 1.16s avg | ≈+0.590 accuracy | LongMemEval (longmemeval_s_cleaned) | Table 5: GPT-4o-mini results across frameworks | §4.4 Table 5 |
| Accuracy | Mem0 overall 0.602 at ≈2106s avg | Haystack overall 0.630 at ≈1.00s avg | ≈-0.028 accuracy, +2105s latency | LongMemEval | Table 5: ingestion and inference dominated Mem0 runtime | §4.4 Table 5 |
What To Try In 7 Days
Add a vector DB (Chroma/FAISS) and test RAG on a critical QA flow
Measure end-to-end ingestion latency for any chosen memory platform
Run a 'needle-in-haystack' test for your data to estimate recall under noise
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Optimization Features
Token Efficiency
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Survey synthesizes many papers but does not deliver a single unified evaluation protocol for all memory types.
Reported benchmark numbers depend on specific engines (Llama3-8B-IT, GPT-4o-mini) and dataset sampling; results may vary with other LLMs.
When Not To Use
Don't rely solely on parametric memory for keeping facts up-to-date.
Avoid heavyweight memory platforms if your CPU/GPU budget cannot tolerate large ingestion latency.
Failure Modes
Memory contamination: storing irrelevant or incorrect records that cause hallucination.
Distribution shift between cached memory representations and updated model parameters causing retrieval mismatch.

