Zep: temporal knowledge-graph memory for agents — faster retrieval and better long-term accuracy

January 20, 20257 min

Overview

Production Readiness

0.8

Novelty Score

0.7

Cost Impact Score

0.8

Citation Count

4

Authors

Preston Rasmussen, Pavlo Paliychuk, Travis Beauvais, Jack Ryan, Daniel Chalef

Links

Abstract / PDF

Why It Matters For Business

Zep returns smaller, temporally-correct context to LLMs, so agents answer complex multi-session and time-sensitive questions more accurately while cutting latency and token costs.

Summary TLDR

Zep is a production memory layer that uses Graphiti, a temporally-aware knowledge graph, to store episodes, semantic entities, and community summaries. It combines vector/BM25/graph search with rerankers and temporal edge invalidation to deliver more accurate and much faster memory retrieval for multi-session agents. On Deep Memory Retrieval (DMR) Zep slightly outperforms MemGPT (94.8% vs 93.4%). On the harder LongMemEval benchmark Zep reports up to +18.5% accuracy and ~90% reduced latency versus full-context baselines. Benchmarks have limitations; real-world gains are largest for cross-session and temporal reasoning tasks.

Problem Statement

Current RAG systems mostly index static documents and cannot represent evolving conversational facts or cross-session enterprise data. Agents need a searchable, temporal memory that preserves history, handles updates, and returns compact, relevant context to LLMs at low latency.

Main Contribution

Graphiti: a temporally-aware knowledge graph with three tiers—episodes, semantic entities, communities

Bi-temporal modeling and edge invalidation to track when facts become valid/invalid

Hybrid retrieval: vector (cosine), BM25, and breadth-first graph search plus multiple rerankers (RRF, MMR, crossencoders)

Empirical results showing better accuracy and much lower latency on DMR and LongMemEval benchmarks

Practical design choices for production: incremental community updates, Cypher-based ingestion, and embedding-based resolution

Key Findings

Zep edges back to MemGPT on DMR with gpt-4-turbo

Numbers94.8% vs 93.4% (DMR, gpt-4-turbo)

Zep gave large accuracy boosts on a long, realistic benchmark

Numbers+18.5% accuracy (LongMemEval, gpt-4o)

Zep reduced response latency substantially by returning smaller contexts

Numbers~90% latency reduction (experiment claim)

Performance declined on one question type

Numbers-17.7% (single-session-assistant, gpt-4o)

DMR benchmark is limited for enterprise memory evaluation

NumbersDMR conversations: 60 messages each (fits in context)

Results

Accuracy

Value94.8% (Zep, gpt-4-turbo)

BaselineMemGPT 93.4% (gpt-4-turbo)

Accuracy

Value98.2% (Zep, gpt-4o-mini)

BaselineFull-conversation 98.0% (gpt-4o-mini)

Accuracy

Value+18.5% (Zep, gpt-4o)

BaselineFull-context baseline (per paper)

Latency reduction

Value~90% latency reduction reported

BaselineFull-context implementations with large contexts (~115k tokens)

Who Should Care

What To Try In 7 Days

Index a week of multi-session chat logs into Graphiti and compare answer accuracy vs full-context prompts

Run LongMemEval or task-focused temporal questions on your data to measure real gains

Replace full-conversation context with top-N Graphiti facts and measure API token and latency savings

Agent Features

Memory

  • episodic memory (raw messages)
  • semantic memory (entities and facts)
  • community summaries (high-level clusters)

Tool Use

  • RAG-style retrieval
  • crossencoder rerankers
  • graph traversal (BFS)

Frameworks

  • Graphiti
  • Zep

Is Agentic

true

Architectures

  • temporal knowledge graph
  • hierarchical subgraphs (episode/entity/community)

Optimization Features

Token Efficiency

  • reduces context tokens from ~115k to ~1.6k when retrieving targeted facts

Infra Optimization

  • hybrid Lucene + vector index via Neo4j
  • ability to run rerankers selectively to balance cost

System Optimization

  • incremental community updates to avoid full recompute
  • Cypher-based ingestion for consistent schema

Inference Optimization

  • smaller context prompts via retrieved facts
  • use of rerankers to reduce LLM calls

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • DMR is small and often fits in model context windows; its results overstate practical difficulty (Sec.4.2)
  • LongMemEval experiments ran from a residential laptop with AWS-hosted service, adding network latency variability (Sec.4.3)
  • Dynamic community updates are approximate and require periodic full refreshes
  • Some question types (single-session-assistant) showed notable performance drops

When Not To Use

  • When conversation history always fits in your LLM context window
  • When you cannot host a graph DB or accept extra infra complexity
  • When you need lowest-cost per-query and cannot afford reranker or crossencoder compute

Failure Modes

  • Entity resolution mistakes leading to merged or duplicated entities
  • Incorrect temporal extraction or edge invalidation producing stale facts
  • High-cost crossencoder reranking harming latency and budget
  • Community divergence over long incremental updates without refresh

Core Entities

Models

  • gpt-4-turbo
  • gpt-4o-mini-2024-07-18
  • gpt-4o-2024-11-20
  • gpt-4o
  • BGE-m3

Metrics

  • Accuracy
  • latency
  • avg_context_tokens

Datasets

  • Deep Memory Retrieval (DMR)
  • LongMemEval s
  • Multi-Session Chat subset

Benchmarks

  • DMR
  • LongMemEval