Overview
Production Readiness
0.6
Novelty Score
0.5
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
Memory systems let AI keep facts current, personalize across sessions, and reduce costly retraining; pick simple vector RAG for factual QA and reserve heavy platforms for offline analytics or research.
Summary TLDR
This is a broad, engineering-focused survey of memory for large language models and multimodal agents. It organizes prior work into three clear families — implicit memory (knowledge inside weights), explicit memory (external stores and retrieval/RAG), and agentic memory (persistent short- and long-term memory for agents) — and reviews architectures, training recipes, benchmarks, tools, and open problems (scaling laws, hallucination, retrieval contamination, and latency vs. accuracy trade-offs). The paper includes concrete comparisons (LongMemEval) showing simple vector DBs deliver large factual gains, while full-featured memory platforms often add expensive ingestion costs.
Problem Statement
Modern LLMs need memory to adapt, stay up-to-date, and act over long interactions. The field lacks a unified map of what memory means, how it is represented (weights, vectors, graphs), how to efficiently train and edit it, how to evaluate agentic memory, and how to balance accuracy, latency, and contamination risk.
Main Contribution
A unified taxonomy of memory in (M)LLMs: implicit (weights), explicit (external stores/RAG), and agentic (persistent agent memory).
A survey of mechanisms to analyze, edit, and unlearn implicit (parametric) memory, including ROME, MEMIT, and related methods.
A review of explicit-memory engineering: document/vector/graph representations, training pipelines (pretrain/finetune), and retrieval-augmented methods (RAG, RETRO, Memory3).
A practical evaluation summary and toolkit comparison (LongMemEval) showing accuracy vs latency trade-offs across memory frameworks (ChromaDB, Haystack, LlamaIndex, Mem0, Zep).
A focused section on multimodal and robotics memory needs and open problems for long, time-series, and embodied contexts.
Key Findings
Memory in LLMs is usefully grouped into three families: implicit (weights), explicit (external stores), and agentic (persistent agent memory).
Parametric models have limited factual capacity; theory and experiments estimate roughly '2 bits of knowledge per parameter' and fact memorization scales poorly.
Retrieval-augmented setups give large factual accuracy gains on long-context QA versus no memory.
Full-featured memory platforms can dramatically increase ingestion and end-to-end latency without proportional accuracy gains.
Multi-session (cross-session) reasoning is the hardest memory task across frameworks.
Results
Accuracy
Accuracy
Who Should Care
What To Try In 7 Days
Add a vector DB (Chroma/FAISS) and test RAG on a critical QA flow
Measure end-to-end ingestion latency for any chosen memory platform
Run a 'needle-in-haystack' test for your data to estimate recall under noise
Agent Features
Memory
- short-term context window (STM)
- long-term external memory (LTM)
- episodic trajectories and feedback stores
Planning
- retrieve-then-plan
- reflection loops (self-RAG/RAM)
- retrieval-augmented planning (RAP)
Tool Use
- external retriever
- vector DB API
- knowledge graph queries
- tool invocation in ReAct
Frameworks
- LangChain
- LlamaIndex
- Haystack
- Mem0
- Zep
Is Agentic
true
Architectures
- parametric (weights)
- non-parametric vector DB
- graph-based memory
- key-value FFN memory (implicit)
Collaboration
- shared vector pools
- hierarchical team graphs (A‑MEM style)
- message-based memory exchange (IoA)
Optimization Features
Token Efficiency
- chunking and summarization
- selective top-k retrieval to reduce context tokens
Infra Optimization
- use of FAISS/ANN libraries for fast retrieval
- cloud-managed vector DBs for persistence (Pinecone/Milvus)
Model Optimization
- LoRA
- targeted layer edits (ROME/MEMIT)
System Optimization
- hierarchical storage (recent in RAM, archive in vector DB)
- index persistence and incremental ingestion
Training Optimization
- retrieval-augmented pretraining (RETRO, InstructRetro)
- joint retriever-generator pretraining (Atlas-style)
Inference Optimization
- chunked retrieval + top-k filtering
- Fusion-in-Decoder (FiD) to parallelize passage encoding
Reproducibility
Data Urls
- LongMemEval (referenced dataset)
- https://github.com/gkamradt/LLMTest_NeedleInAHaystack
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Survey synthesizes many papers but does not deliver a single unified evaluation protocol for all memory types.
- Reported benchmark numbers depend on specific engines (Llama3-8B-IT, GPT-4o-mini) and dataset sampling; results may vary with other LLMs.
- Practical trade-offs (latency, cost, ingestion) are dataset- and infra-dependent; treat numbers as illustrative.
When Not To Use
- Don't rely solely on parametric memory for keeping facts up-to-date.
- Avoid heavyweight memory platforms if your CPU/GPU budget cannot tolerate large ingestion latency.
- Do not assume single-session solutions generalize to multi-session reasoning without tailored indexing and summarization.
Failure Modes
- Memory contamination: storing irrelevant or incorrect records that cause hallucination.
- Distribution shift between cached memory representations and updated model parameters causing retrieval mismatch.
- Latency spikes during ingestion that break real-time SLAs.
Core Entities
Models
- RETRO
- RETRO++
- RAG
- Memory3
- MemTRM
- Unlimiformer
- LongMemEval
- Llama3-8B-IT
- GPT-4o-mini
- ROME
- MEMIT
Metrics
- Accuracy
- Latency (s)
- Recall / Precision (memory IE)
- Multi-Session Reasoning (MS)
- Temporal Reasoning (TR)
- Knowledge Update (KU)
Datasets
- LongMemEval
- Longmemeval_s_cleaned
- NarrativeQA
- QuALITY
- Loogle
- RetrievalQA
Benchmarks
- LongMemEval
- KnowEdit
- MQuAKE
- Eva-KELLM
- NeedleInAHaystack

