A practical survey of memory in LLMs: implicit weights, external retrieval, and agent memory

January 14, 20269 min

Overview

Production Readiness

0.6

Novelty Score

0.5

Cost Impact Score

0.6

Citation Count

0

Authors

Zixia Jia, Jiaqi Li, Yipeng Kang, Yuxuan Wang, Tong Wu, Quansen Wang, Xiaobo Wang, Shuyi Zhang, Junzhe Shen, Qing Li, Siyuan Qi, Yitao Liang, Di He, Zilong Zheng, Song-Chun Zhu

Links

Abstract / PDF

Why It Matters For Business

Memory systems let AI keep facts current, personalize across sessions, and reduce costly retraining; pick simple vector RAG for factual QA and reserve heavy platforms for offline analytics or research.

Summary TLDR

This is a broad, engineering-focused survey of memory for large language models and multimodal agents. It organizes prior work into three clear families — implicit memory (knowledge inside weights), explicit memory (external stores and retrieval/RAG), and agentic memory (persistent short- and long-term memory for agents) — and reviews architectures, training recipes, benchmarks, tools, and open problems (scaling laws, hallucination, retrieval contamination, and latency vs. accuracy trade-offs). The paper includes concrete comparisons (LongMemEval) showing simple vector DBs deliver large factual gains, while full-featured memory platforms often add expensive ingestion costs.

Problem Statement

Modern LLMs need memory to adapt, stay up-to-date, and act over long interactions. The field lacks a unified map of what memory means, how it is represented (weights, vectors, graphs), how to efficiently train and edit it, how to evaluate agentic memory, and how to balance accuracy, latency, and contamination risk.

Main Contribution

A unified taxonomy of memory in (M)LLMs: implicit (weights), explicit (external stores/RAG), and agentic (persistent agent memory).

A survey of mechanisms to analyze, edit, and unlearn implicit (parametric) memory, including ROME, MEMIT, and related methods.

A review of explicit-memory engineering: document/vector/graph representations, training pipelines (pretrain/finetune), and retrieval-augmented methods (RAG, RETRO, Memory3).

A practical evaluation summary and toolkit comparison (LongMemEval) showing accuracy vs latency trade-offs across memory frameworks (ChromaDB, Haystack, LlamaIndex, Mem0, Zep).

A focused section on multimodal and robotics memory needs and open problems for long, time-series, and embodied contexts.

Key Findings

Memory in LLMs is usefully grouped into three families: implicit (weights), explicit (external stores), and agentic (persistent agent memory).

Parametric models have limited factual capacity; theory and experiments estimate roughly '2 bits of knowledge per parameter' and fact memorization scales poorly.

Numbers2 bits per parameter (Allen‑Zhu & Li 2024a)

Retrieval-augmented setups give large factual accuracy gains on long-context QA versus no memory.

NumbersGPT-4o-mini: ChromaDB overall 0.600 vs No Memory 0.010 on LongMemEval

Full-featured memory platforms can dramatically increase ingestion and end-to-end latency without proportional accuracy gains.

NumbersMem0 overall 0.602 but avg time ≈2106s vs Haystack 0.630 at ≈1s (GPT-4o-mini)

Multi-session (cross-session) reasoning is the hardest memory task across frameworks.

NumbersMulti-session scores low (e.g., ChromaDB MS 0.222, Haystack MS 0.148 with GPT-4o-mini)

Results

Accuracy

ValueChromaDB 0.600 overall, 6.41s avg

BaselineNo Memory 0.010 overall, 1.16s avg

Accuracy

ValueMem0 overall 0.602 at ≈2106s avg

BaselineHaystack overall 0.630 at ≈1.00s avg

Who Should Care

What To Try In 7 Days

Add a vector DB (Chroma/FAISS) and test RAG on a critical QA flow

Measure end-to-end ingestion latency for any chosen memory platform

Run a 'needle-in-haystack' test for your data to estimate recall under noise

Agent Features

Memory

  • short-term context window (STM)
  • long-term external memory (LTM)
  • episodic trajectories and feedback stores

Planning

  • retrieve-then-plan
  • reflection loops (self-RAG/RAM)
  • retrieval-augmented planning (RAP)

Tool Use

  • external retriever
  • vector DB API
  • knowledge graph queries
  • tool invocation in ReAct

Frameworks

  • LangChain
  • LlamaIndex
  • Haystack
  • Mem0
  • Zep

Is Agentic

true

Architectures

  • parametric (weights)
  • non-parametric vector DB
  • graph-based memory
  • key-value FFN memory (implicit)

Collaboration

  • shared vector pools
  • hierarchical team graphs (A‑MEM style)
  • message-based memory exchange (IoA)

Optimization Features

Token Efficiency

  • chunking and summarization
  • selective top-k retrieval to reduce context tokens

Infra Optimization

  • use of FAISS/ANN libraries for fast retrieval
  • cloud-managed vector DBs for persistence (Pinecone/Milvus)

Model Optimization

  • LoRA
  • targeted layer edits (ROME/MEMIT)

System Optimization

  • hierarchical storage (recent in RAM, archive in vector DB)
  • index persistence and incremental ingestion

Training Optimization

  • retrieval-augmented pretraining (RETRO, InstructRetro)
  • joint retriever-generator pretraining (Atlas-style)

Inference Optimization

  • chunked retrieval + top-k filtering
  • Fusion-in-Decoder (FiD) to parallelize passage encoding

Reproducibility

Data Urls

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Survey synthesizes many papers but does not deliver a single unified evaluation protocol for all memory types.
  • Reported benchmark numbers depend on specific engines (Llama3-8B-IT, GPT-4o-mini) and dataset sampling; results may vary with other LLMs.
  • Practical trade-offs (latency, cost, ingestion) are dataset- and infra-dependent; treat numbers as illustrative.

When Not To Use

  • Don't rely solely on parametric memory for keeping facts up-to-date.
  • Avoid heavyweight memory platforms if your CPU/GPU budget cannot tolerate large ingestion latency.
  • Do not assume single-session solutions generalize to multi-session reasoning without tailored indexing and summarization.

Failure Modes

  • Memory contamination: storing irrelevant or incorrect records that cause hallucination.
  • Distribution shift between cached memory representations and updated model parameters causing retrieval mismatch.
  • Latency spikes during ingestion that break real-time SLAs.

Core Entities

Models

  • RETRO
  • RETRO++
  • RAG
  • Memory3
  • MemTRM
  • Unlimiformer
  • LongMemEval
  • Llama3-8B-IT
  • GPT-4o-mini
  • ROME
  • MEMIT

Metrics

  • Accuracy
  • Latency (s)
  • Recall / Precision (memory IE)
  • Multi-Session Reasoning (MS)
  • Temporal Reasoning (TR)
  • Knowledge Update (KU)

Datasets

  • LongMemEval
  • Longmemeval_s_cleaned
  • NarrativeQA
  • QuALITY
  • Loogle
  • RetrievalQA

Benchmarks

  • LongMemEval
  • KnowEdit
  • MQuAKE
  • Eva-KELLM
  • NeedleInAHaystack