A practical survey of memory in LLMs: implicit weights, external retrieval, and agent memory

Overview

Decision SnapshotReady For Pilot

This survey integrates prior work and practical evaluations; use it as an operational map but validate tool choices on your own latency, ingestion, and multi-session tests.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/5

Findings with evidence refs: 5/5

Results with explicit delta: 2/2

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 50%

Authors

Zixia Jia, Jiaqi Li, Yipeng Kang, Yuxuan Wang, Tong Wu, Quansen Wang, Xiaobo Wang, Shuyi Zhang, Junzhe Shen, Qing Li, Siyuan Qi, Yitao Liang, Di He, Zilong Zheng, Song-Chun Zhu

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Memory systems let AI keep facts current, personalize across sessions, and reduce costly retraining; pick simple vector RAG for factual QA and reserve heavy platforms for offline analytics or research.

Who Should Care

CTO Product Manager ML Engineer Data Scientist Founder

Summary TLDR

This is a broad, engineering-focused survey of memory for large language models and multimodal agents. It organizes prior work into three clear families — implicit memory (knowledge inside weights), explicit memory (external stores and retrieval/RAG), and agentic memory (persistent short- and long-term memory for agents) — and reviews architectures, training recipes, benchmarks, tools, and open problems (scaling laws, hallucination, retrieval contamination, and latency vs. accuracy trade-offs). The paper includes concrete comparisons (LongMemEval) showing simple vector DBs deliver large factual gains, while full-featured memory platforms often add expensive ingestion costs.

Problem Statement

Modern LLMs need memory to adapt, stay up-to-date, and act over long interactions. The field lacks a unified map of what memory means, how it is represented (weights, vectors, graphs), how to efficiently train and edit it, how to evaluate agentic memory, and how to balance accuracy, latency, and contamination risk.

Main Contribution

A unified taxonomy of memory in (M)LLMs: implicit (weights), explicit (external stores/RAG), and agentic (persistent agent memory).

A survey of mechanisms to analyze, edit, and unlearn implicit (parametric) memory, including ROME, MEMIT, and related methods.

Key Findings

Memory in LLMs is usefully grouped into three families: implicit (weights), explicit (external stores), and agentic (persistent agent memory).

Practical UseDesign systems by combining these families: keep stable world knowledge in parameters, use RAG/vector stores for up-to-date facts, and add agentic memory for session continuity and personalization.

Evidence RefSurvey taxonomy (§1–§4, Figure 1)

Parametric models have limited factual capacity; theory and experiments estimate roughly '2 bits of knowledge per parameter' and fact memorization scales poorly.

Numbers2 bits per parameter (Allen‑Zhu & Li 2024a)

Practical UseDo not rely on weights alone to store all public facts; add explicit retrieval or editing pipelines for coverage and fresh facts.

Evidence Ref§2.1 Scaling Law of Knowledge Memorization

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	ChromaDB 0.600 overall, 6.41s avg	No Memory 0.010 overall, 1.16s avg	≈+0.590 accuracy	LongMemEval (longmemeval_s_cleaned)	Table 5: GPT-4o-mini results across frameworks	§4.4 Table 5
Accuracy	Mem0 overall 0.602 at ≈2106s avg	Haystack overall 0.630 at ≈1.00s avg	≈-0.028 accuracy, +2105s latency	LongMemEval	Table 5: ingestion and inference dominated Mem0 runtime	§4.4 Table 5

What To Try In 7 Days

Add a vector DB (Chroma/FAISS) and test RAG on a critical QA flow

Measure end-to-end ingestion latency for any chosen memory platform

Run a 'needle-in-haystack' test for your data to estimate recall under noise

Agent Features

Memory

short-term context window (STM)long-term external memory (LTM)episodic trajectories and feedback stores

Planning

retrieve-then-planreflection loops (self-RAG/RAM)retrieval-augmented planning (RAP)

Tool Use

external retrievervector DB APIknowledge graph queriestool invocation in ReAct

Frameworks

LangChainLlamaIndexHaystackMem0Zep

Is Agentic

Yes

Architectures

parametric (weights)non-parametric vector DBgraph-based memorykey-value FFN memory (implicit)

Collaboration

shared vector poolshierarchical team graphs (A‑MEM style)message-based memory exchange (IoA)

Optimization Features

Token Efficiency

chunking and summarizationselective top-k retrieval to reduce context tokens

Infra Optimization

use of FAISS/ANN libraries for fast retrievalcloud-managed vector DBs for persistence (Pinecone/Milvus)

Model Optimization

LoRAtargeted layer edits (ROME/MEMIT)

System Optimization

hierarchical storage (recent in RAM, archive in vector DB)index persistence and incremental ingestion

Training Optimization

retrieval-augmented pretraining (RETRO, InstructRetro)joint retriever-generator pretraining (Atlas-style)

Inference Optimization

chunked retrieval + top-k filteringFusion-in-Decoder (FiD) to parallelize passage encoding

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/bigai-nlco/LLM-Memory-Survey

Data URLs

LongMemEval (referenced dataset)https://github.com/gkamradt/LLMTest_NeedleInAHaystack

Risks & Boundaries

Limitations

Survey synthesizes many papers but does not deliver a single unified evaluation protocol for all memory types.

Reported benchmark numbers depend on specific engines (Llama3-8B-IT, GPT-4o-mini) and dataset sampling; results may vary with other LLMs.

When Not To Use

Don't rely solely on parametric memory for keeping facts up-to-date.

Avoid heavyweight memory platforms if your CPU/GPU budget cannot tolerate large ingestion latency.

Failure Modes

Memory contamination: storing irrelevant or incorrect records that cause hallucination.

Distribution shift between cached memory representations and updated model parameters causing retrieval mismatch.

Core Entities

Models

RETRORETRO++RAGMemory3MemTRMUnlimiformerLongMemEvalLlama3-8B-ITGPT-4o-miniROMEMEMIT

Metrics

AccuracyLatency (s)Recall / Precision (memory IE)Multi-Session Reasoning (MS)Temporal Reasoning (TR)Knowledge Update (KU)

Datasets

LongMemEvalLongmemeval_s_cleanedNarrativeQAQuALITYLoogleRetrievalQA

Benchmarks

LongMemEvalKnowEditMQuAKEEva-KELLMNeedleInAHaystack

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Memory in LLMs is usefully grouped into three families: implicit (weights), explicit (external stores), and agentic (persistent agent memory).

Parametric models have limited factual capacity; theory and experiments estimate roughly '2 bits of knowledge per parameter' and fact memorization scales poorly.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

DrugPilot: LLM agent with a key-value memory pool for reliable drug-discovery tool calling

Key finding

Jointly erase private facts from an LLM agent's weights and persistent memory to stop recontamination

Key finding

A-MEM: LLM agents that build and evolve a Zettelkasten-style linked memory

Key finding

Use LLM agents plus DRL and tiny adapters to meet operator intents while cutting active radio units and memory use

Key finding

Query-aware indexing cuts memory search to ~11ms — 47× faster while keeping competitive accuracy

Key finding