Zep: temporal knowledge-graph memory for agents — faster retrieval and better long-term accuracy

Overview

Decision SnapshotNeeds Validation

The system is production-minded: it balances retrieval precision, temporal correctness, and latency. Evidence comes from benchmark wins and engineering choices, but broader real-world validation and open-source artifacts are partial.

Citations4

Evidence Strength0.70

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 80%

Novelty: 70%

Authors

Preston Rasmussen, Pavlo Paliychuk, Travis Beauvais, Jack Ryan, Daniel Chalef

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Zep returns smaller, temporally-correct context to LLMs, so agents answer complex multi-session and time-sensitive questions more accurately while cutting latency and token costs.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Founder

Summary TLDR

Zep is a production memory layer that uses Graphiti, a temporally-aware knowledge graph, to store episodes, semantic entities, and community summaries. It combines vector/BM25/graph search with rerankers and temporal edge invalidation to deliver more accurate and much faster memory retrieval for multi-session agents. On Deep Memory Retrieval (DMR) Zep slightly outperforms MemGPT (94.8% vs 93.4%). On the harder LongMemEval benchmark Zep reports up to +18.5% accuracy and ~90% reduced latency versus full-context baselines. Benchmarks have limitations; real-world gains are largest for cross-session and temporal reasoning tasks.

Problem Statement

Current RAG systems mostly index static documents and cannot represent evolving conversational facts or cross-session enterprise data. Agents need a searchable, temporal memory that preserves history, handles updates, and returns compact, relevant context to LLMs at low latency.

Main Contribution

Graphiti: a temporally-aware knowledge graph with three tiers—episodes, semantic entities, communities

Bi-temporal modeling and edge invalidation to track when facts become valid/invalid

Key Findings

Zep edges back to MemGPT on DMR with gpt-4-turbo

Numbers94.8% vs 93.4% (DMR, gpt-4-turbo)

Practical UseIf you already use MemGPT-style memory for small multi-session chats, switching to Zep can give marginal accuracy gains on DMR-style fact retrieval.

Evidence RefTable 1; Sec.4.2

Zep gave large accuracy boosts on a long, realistic benchmark

Numbers+18.5% accuracy (LongMemEval, gpt-4o)

Practical UseFor enterprise scenarios with long histories and temporal questions, Graphiti-style temporal graphs can substantially improve correctness.

Evidence RefSec.4.3.2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	94.8% (Zep, gpt-4-turbo)	MemGPT 93.4% (gpt-4-turbo)	+1.4pp	DMR (500 conversations)	Table 1; Sec.4.2	Table 1
Accuracy	98.2% (Zep, gpt-4o-mini)	Full-conversation 98.0% (gpt-4o-mini)	+0.2pp	DMR (500 conversations)	Table 1; Sec.4.1-4.2	Table 1

What To Try In 7 Days

Index a week of multi-session chat logs into Graphiti and compare answer accuracy vs full-context prompts

Run LongMemEval or task-focused temporal questions on your data to measure real gains

Replace full-conversation context with top-N Graphiti facts and measure API token and latency savings

Agent Features

Memory

episodic memory (raw messages)semantic memory (entities and facts)community summaries (high-level clusters)

Tool Use

RAG-style retrievalcrossencoder rerankersgraph traversal (BFS)

Frameworks

GraphitiZep

Is Agentic

Yes

Architectures

temporal knowledge graphhierarchical subgraphs (episode/entity/community)

Optimization Features

Token Efficiency

reduces context tokens from ~115k to ~1.6k when retrieving targeted facts

Infra Optimization

hybrid Lucene + vector index via Neo4jability to run rerankers selectively to balance cost

System Optimization

incremental community updates to avoid full recomputeCypher-based ingestion for consistent schema

Inference Optimization

smaller context prompts via retrieved factsuse of rerankers to reduce LLM calls

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/getzep/graphiti https://www.getzep.com

Data URLs

https://arxiv.org/abs/2501.13956

Risks & Boundaries

Limitations

DMR is small and often fits in model context windows; its results overstate practical difficulty (Sec.4.2)

LongMemEval experiments ran from a residential laptop with AWS-hosted service, adding network latency variability (Sec.4.3)

When Not To Use

When conversation history always fits in your LLM context window

When you cannot host a graph DB or accept extra infra complexity

Failure Modes

Entity resolution mistakes leading to merged or duplicated entities

Incorrect temporal extraction or edge invalidation producing stale facts

Core Entities

Models

gpt-4-turbogpt-4o-mini-2024-07-18gpt-4o-2024-11-20gpt-4oBGE-m3

Metrics

Accuracylatencyavg_context_tokens

Datasets

Deep Memory Retrieval (DMR)LongMemEval sMulti-Session Chat subset

Benchmarks

DMRLongMemEval

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Zep edges back to MemGPT on DMR with gpt-4-turbo

Zep gave large accuracy boosts on a long, realistic benchmark

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Survey of how LLMs become autonomous agents, the core architecture, and the research gaps to make them safe and practical.

Key finding

Agentic ROI: prioritize real user value, not raw model scores

Key finding

Hierarchical multi-agent research agent that compresses long context, routes subtasks to specialized tools, and self-corrects failures.

Key finding

Declarative agent spec plus a runtime that enforces safety, memory, and low-latency execution

Key finding

Jointly erase private facts from an LLM agent's weights and persistent memory to stop recontamination

Key finding