A practical survey of memory in LLMs: implicit weights, external retrieval, and agent memory

January 14, 20269 min

Overview

Decision SnapshotReady For Pilot

This survey integrates prior work and practical evaluations; use it as an operational map but validate tool choices on your own latency, ingestion, and multi-session tests.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/5

Findings with evidence refs: 5/5

Results with explicit delta: 2/2

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 50%

Authors

Zixia Jia, Jiaqi Li, Yipeng Kang, Yuxuan Wang, Tong Wu, Quansen Wang, Xiaobo Wang, Shuyi Zhang, Junzhe Shen, Qing Li, Siyuan Qi, Yitao Liang, Di He, Zilong Zheng, Song-Chun Zhu

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Memory systems let AI keep facts current, personalize across sessions, and reduce costly retraining; pick simple vector RAG for factual QA and reserve heavy platforms for offline analytics or research.

Who Should Care

Summary TLDR

This is a broad, engineering-focused survey of memory for large language models and multimodal agents. It organizes prior work into three clear families — implicit memory (knowledge inside weights), explicit memory (external stores and retrieval/RAG), and agentic memory (persistent short- and long-term memory for agents) — and reviews architectures, training recipes, benchmarks, tools, and open problems (scaling laws, hallucination, retrieval contamination, and latency vs. accuracy trade-offs). The paper includes concrete comparisons (LongMemEval) showing simple vector DBs deliver large factual gains, while full-featured memory platforms often add expensive ingestion costs.

Problem Statement

Modern LLMs need memory to adapt, stay up-to-date, and act over long interactions. The field lacks a unified map of what memory means, how it is represented (weights, vectors, graphs), how to efficiently train and edit it, how to evaluate agentic memory, and how to balance accuracy, latency, and contamination risk.

Main Contribution

A unified taxonomy of memory in (M)LLMs: implicit (weights), explicit (external stores/RAG), and agentic (persistent agent memory).

A survey of mechanisms to analyze, edit, and unlearn implicit (parametric) memory, including ROME, MEMIT, and related methods.

Key Findings

Memory in LLMs is usefully grouped into three families: implicit (weights), explicit (external stores), and agentic (persistent agent memory).

Practical UseDesign systems by combining these families: keep stable world knowledge in parameters, use RAG/vector stores for up-to-date facts, and add agentic memory for session continuity and personalization.

Evidence RefSurvey taxonomy (§1–§4, Figure 1)

Parametric models have limited factual capacity; theory and experiments estimate roughly '2 bits of knowledge per parameter' and fact memorization scales poorly.

Numbers2 bits per parameter (Allen‑Zhu & Li 2024a)

Practical UseDo not rely on weights alone to store all public facts; add explicit retrieval or editing pipelines for coverage and fresh facts.

Evidence Ref§2.1 Scaling Law of Knowledge Memorization

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
AccuracyChromaDB 0.600 overall, 6.41s avgNo Memory 0.010 overall, 1.16s avg≈+0.590 accuracyLongMemEval (longmemeval_s_cleaned)Table 5: GPT-4o-mini results across frameworks§4.4 Table 5
AccuracyMem0 overall 0.602 at ≈2106s avgHaystack overall 0.630 at ≈1.00s avg≈-0.028 accuracy, +2105s latencyLongMemEvalTable 5: ingestion and inference dominated Mem0 runtime§4.4 Table 5

What To Try In 7 Days

Add a vector DB (Chroma/FAISS) and test RAG on a critical QA flow

Measure end-to-end ingestion latency for any chosen memory platform

Run a 'needle-in-haystack' test for your data to estimate recall under noise

Agent Features

Memory
short-term context window (STM)long-term external memory (LTM)episodic trajectories and feedback stores
Planning
retrieve-then-planreflection loops (self-RAG/RAM)retrieval-augmented planning (RAP)
Tool Use
external retrievervector DB APIknowledge graph queriestool invocation in ReAct
Frameworks
LangChainLlamaIndexHaystackMem0Zep
Is Agentic

Yes

Architectures
parametric (weights)non-parametric vector DBgraph-based memorykey-value FFN memory (implicit)
Collaboration
shared vector poolshierarchical team graphs (A‑MEM style)message-based memory exchange (IoA)

Optimization Features

Token Efficiency
chunking and summarizationselective top-k retrieval to reduce context tokens
Infra Optimization
use of FAISS/ANN libraries for fast retrievalcloud-managed vector DBs for persistence (Pinecone/Milvus)
Model Optimization
LoRAtargeted layer edits (ROME/MEMIT)
System Optimization
hierarchical storage (recent in RAM, archive in vector DB)index persistence and incremental ingestion
Training Optimization
retrieval-augmented pretraining (RETRO, InstructRetro)joint retriever-generator pretraining (Atlas-style)
Inference Optimization
chunked retrieval + top-k filteringFusion-in-Decoder (FiD) to parallelize passage encoding

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

Risks & Boundaries

Limitations

Survey synthesizes many papers but does not deliver a single unified evaluation protocol for all memory types.

Reported benchmark numbers depend on specific engines (Llama3-8B-IT, GPT-4o-mini) and dataset sampling; results may vary with other LLMs.

When Not To Use

Don't rely solely on parametric memory for keeping facts up-to-date.

Avoid heavyweight memory platforms if your CPU/GPU budget cannot tolerate large ingestion latency.

Failure Modes

Memory contamination: storing irrelevant or incorrect records that cause hallucination.

Distribution shift between cached memory representations and updated model parameters causing retrieval mismatch.

Core Entities

Models

RETRORETRO++RAGMemory3MemTRMUnlimiformerLongMemEvalLlama3-8B-ITGPT-4o-miniROMEMEMIT

Metrics

AccuracyLatency (s)Recall / Precision (memory IE)Multi-Session Reasoning (MS)Temporal Reasoning (TR)Knowledge Update (KU)

Datasets

LongMemEvalLongmemeval_s_cleanedNarrativeQAQuALITYLoogleRetrievalQA

Benchmarks

LongMemEvalKnowEditMQuAKEEva-KELLMNeedleInAHaystack