Zep: temporal knowledge-graph memory for agents — faster retrieval and better long-term accuracy

January 20, 20257 min

Overview

Decision SnapshotNeeds Validation

The system is production-minded: it balances retrieval precision, temporal correctness, and latency. Evidence comes from benchmark wins and engineering choices, but broader real-world validation and open-source artifacts are partial.

Citations4

Evidence Strength0.70

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 80%

Novelty: 70%

Authors

Preston Rasmussen, Pavlo Paliychuk, Travis Beauvais, Jack Ryan, Daniel Chalef

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Zep returns smaller, temporally-correct context to LLMs, so agents answer complex multi-session and time-sensitive questions more accurately while cutting latency and token costs.

Who Should Care

Summary TLDR

Zep is a production memory layer that uses Graphiti, a temporally-aware knowledge graph, to store episodes, semantic entities, and community summaries. It combines vector/BM25/graph search with rerankers and temporal edge invalidation to deliver more accurate and much faster memory retrieval for multi-session agents. On Deep Memory Retrieval (DMR) Zep slightly outperforms MemGPT (94.8% vs 93.4%). On the harder LongMemEval benchmark Zep reports up to +18.5% accuracy and ~90% reduced latency versus full-context baselines. Benchmarks have limitations; real-world gains are largest for cross-session and temporal reasoning tasks.

Problem Statement

Current RAG systems mostly index static documents and cannot represent evolving conversational facts or cross-session enterprise data. Agents need a searchable, temporal memory that preserves history, handles updates, and returns compact, relevant context to LLMs at low latency.

Main Contribution

Graphiti: a temporally-aware knowledge graph with three tiers—episodes, semantic entities, communities

Bi-temporal modeling and edge invalidation to track when facts become valid/invalid

Key Findings

Zep edges back to MemGPT on DMR with gpt-4-turbo

Numbers94.8% vs 93.4% (DMR, gpt-4-turbo)

Practical UseIf you already use MemGPT-style memory for small multi-session chats, switching to Zep can give marginal accuracy gains on DMR-style fact retrieval.

Evidence RefTable 1; Sec.4.2

Zep gave large accuracy boosts on a long, realistic benchmark

Numbers+18.5% accuracy (LongMemEval, gpt-4o)

Practical UseFor enterprise scenarios with long histories and temporal questions, Graphiti-style temporal graphs can substantially improve correctness.

Evidence RefSec.4.3.2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy94.8% (Zep, gpt-4-turbo)MemGPT 93.4% (gpt-4-turbo)+1.4ppDMR (500 conversations)Table 1; Sec.4.2Table 1
Accuracy98.2% (Zep, gpt-4o-mini)Full-conversation 98.0% (gpt-4o-mini)+0.2ppDMR (500 conversations)Table 1; Sec.4.1-4.2Table 1

What To Try In 7 Days

Index a week of multi-session chat logs into Graphiti and compare answer accuracy vs full-context prompts

Run LongMemEval or task-focused temporal questions on your data to measure real gains

Replace full-conversation context with top-N Graphiti facts and measure API token and latency savings

Agent Features

Memory
episodic memory (raw messages)semantic memory (entities and facts)community summaries (high-level clusters)
Tool Use
RAG-style retrievalcrossencoder rerankersgraph traversal (BFS)
Frameworks
GraphitiZep
Is Agentic

Yes

Architectures
temporal knowledge graphhierarchical subgraphs (episode/entity/community)

Optimization Features

Token Efficiency
reduces context tokens from ~115k to ~1.6k when retrieving targeted facts
Infra Optimization
hybrid Lucene + vector index via Neo4jability to run rerankers selectively to balance cost
System Optimization
incremental community updates to avoid full recomputeCypher-based ingestion for consistent schema
Inference Optimization
smaller context prompts via retrieved factsuse of rerankers to reduce LLM calls

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

DMR is small and often fits in model context windows; its results overstate practical difficulty (Sec.4.2)

LongMemEval experiments ran from a residential laptop with AWS-hosted service, adding network latency variability (Sec.4.3)

When Not To Use

When conversation history always fits in your LLM context window

When you cannot host a graph DB or accept extra infra complexity

Failure Modes

Entity resolution mistakes leading to merged or duplicated entities

Incorrect temporal extraction or edge invalidation producing stale facts

Core Entities

Models

gpt-4-turbogpt-4o-mini-2024-07-18gpt-4o-2024-11-20gpt-4oBGE-m3

Metrics

Accuracylatencyavg_context_tokens

Datasets

Deep Memory Retrieval (DMR)LongMemEval sMulti-Session Chat subset

Benchmarks

DMRLongMemEval