A-MEM: LLM agents that build and evolve a Zettelkasten-style linked memory

February 17, 20258 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.8

Citation Count

6

Authors

Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, Yongfeng Zhang

Links

Abstract / PDF

Why It Matters For Business

A-MEM cuts token and inference cost by ~85–93% per memory while improving multi-session reasoning, making long-term conversational agents materially cheaper and more capable to run at scale.

Summary TLDR

A-MEM is a memory layer for LLM agents that creates structured "notes" for each interaction (content, LLM-generated keywords, tags, contextual description and embedding), uses embedding-based nearest neighbors to shortlist candidates, and then prompts an LLM to decide links and to update existing notes. On long multi-session QA datasets (LoCoMo, DialSim) A-MEM improves multi-hop reasoning scores and reduces token cost dramatically by using selective top-k retrieval and LLM-driven link/evolution steps. Code and production repo are published.

Problem Statement

Existing memory systems for LLM agents use fixed schemas and preset write/retrieve rules, so they struggle to form new organizational patterns or evolve knowledge over long, open-ended interactions. The paper proposes a dynamic, agent-driven memory that both links new items and updates old ones automatically.

Main Contribution

A-MEM: an agentic memory system that constructs structured notes (content, keywords, tags, contextual description, embedding) and autonomously links and evolves memories.

Two core modules: Link Generation (use embeddings + LLM to decide connections) and Memory Evolution (update contexts/tags of existing notes when new related memories arrive).

Large empirical evaluation on long-term conversation datasets (LoCoMo, DialSim) across six foundation models, showing improved multi-hop reasoning and major token/cost savings.

Open-source code for benchmark evaluation and a production-ready system (links provided).

Key Findings

A-MEM improves DialSim QA accuracy over baselines.

NumbersDialSim F1: A-MEM 3.45 vs LoCoMo 2.55 (+35%) vs MemGPT 1.18 (+192%)

A-MEM greatly boosts multi-hop reasoning for GPT-based models.

NumbersGPT-4o-mini Multi-Hop ROUGE-L: A-MEM 44.27 vs LoCoMo 18.09 (>2×)

A-MEM reduces token usage per memory operation by an order of magnitude.

NumbersA-MEM ~1,200 tokens vs baselines ~16,900 tokens (85–93% reduction); cost < $0.0003 per op

Both Link Generation (LG) and Memory Evolution (ME) are necessary for best performance.

NumbersAblation (GPT-4o-mini Multi-Hop F1): full A-MEM 27.02 → w/oME 21.35 → w/oLG&ME 9.65

Results

DialSim F1

ValueA-MEM 3.45

BaselineLoCoMo 2.55; MemGPT 1.18

Multi-Hop ROUGE-L (GPT-4o-mini)

ValueA-MEM 44.27

BaselineLoCoMo 18.09

Token usage per memory operation

ValueA-MEM ~1,200 tokens

BaselineLoCoMo/MemGPT ~16,900 tokens

Ablation (Multi-Hop F1, GPT-4o-mini)

ValueFull A-MEM 27.02

Baselinew/o ME 21.35; w/oLG&ME 9.65

Retrieval time scaling (A-MEM)

Value0.31 µs → 3.70 µs (1k → 1M memories)

BaselineMemoryBank similar space scaling; MemoryBank slightly faster retrieval

Who Should Care

What To Try In 7 Days

Add a structured note layer: store content + LLM-generated keywords, tags, context, and embeddings.

Implement top-k dense retrieval (start k=10) and prompt an LLM to decide which retrieved items to link.

Run an ablation: compare current memory layer vs A-MEM on a held-out multi-session QA set to measure multi-hop gains and token savings.

Agent Features

Memory

  • Note construction (content, timestamp, keywords, tags, context, embedding)
  • Link generation (LLM judgment over top-k neighbors)
  • Memory evolution (update neighbor contexts/tags)

Planning

  • Dynamic link generation to shape memory graph

Tool Use

  • LLMs to generate keywords/tags/context
  • Dense embedding encoder for similarity search

Frameworks

  • Zettelkasten method (atomic notes + flexible linking)

Is Agentic

true

Architectures

  • Zettelkasten-inspired note graph (atomic notes + boxes)
  • Embedding-based index + LLM decision layer

Collaboration

  • Memory boxes allow a note to belong to multiple linked groups

Optimization Features

Token Efficiency

  • Selective top-k retrieval reduces tokens to ~1.2k per operation
  • Tuned k per task balances context richness and noise

System Optimization

  • Local hosting options (Ollama + LiteLLM) for faster, cheaper runs

Reproducibility

Data Urls

  • LoCoMo (see arXiv:2402.17753)
  • DialSim (see arXiv:2406.13144)

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Performance depends on the underlying LLM quality; different LLMs produce different contexts/links.
  • Current implementation is text-only; multimodal memories (images/audio) are not supported yet.
  • No statistical error bars reported; experiments rely on single runs with API calls.
  • Automatic linking risks creating incorrect or spurious connections (hallucinated associations).

When Not To Use

  • For one-off or very short interactions where long-term structure gives no benefit.
  • When strict privacy or compliance prevents storing or enriching user interactions without additional safeguards.
  • When you need multimodal memory (images/audio) out of the box; A-MEM is text-focused.

Failure Modes

  • Incorrect links: LLM may propose spurious connections that mislead downstream reasoning.
  • Drift: evolving contexts may accumulate noise or conflate distinct facts over time.
  • Retrieval overload: very large k adds noise and harms downstream processing.
  • LLM bias/hallucination can propagate into memory tags and future retrievals.

Core Entities

Models

  • GPT-4o-mini
  • GPT-4o
  • DeepSeek-R1-32B
  • Claude 3.0 Haiku
  • Claude 3.5 Haiku
  • Qwen2.5 (1.5b, 3b)
  • Llama 3.2 (1b, 3b)

Metrics

  • F1
  • BLEU-1
  • ROUGE-L
  • ROUGE-2
  • METEOR
  • SBERT Similarity
  • token usage
  • retrieval time

Datasets

  • LoCoMo
  • DialSim

Benchmarks

  • Long-term conversational QA (LoCoMo, DialSim)