Overview
Production Readiness
0.8
Novelty Score
0.7
Cost Impact Score
0.8
Citation Count
4
Why It Matters For Business
Zep returns smaller, temporally-correct context to LLMs, so agents answer complex multi-session and time-sensitive questions more accurately while cutting latency and token costs.
Summary TLDR
Zep is a production memory layer that uses Graphiti, a temporally-aware knowledge graph, to store episodes, semantic entities, and community summaries. It combines vector/BM25/graph search with rerankers and temporal edge invalidation to deliver more accurate and much faster memory retrieval for multi-session agents. On Deep Memory Retrieval (DMR) Zep slightly outperforms MemGPT (94.8% vs 93.4%). On the harder LongMemEval benchmark Zep reports up to +18.5% accuracy and ~90% reduced latency versus full-context baselines. Benchmarks have limitations; real-world gains are largest for cross-session and temporal reasoning tasks.
Problem Statement
Current RAG systems mostly index static documents and cannot represent evolving conversational facts or cross-session enterprise data. Agents need a searchable, temporal memory that preserves history, handles updates, and returns compact, relevant context to LLMs at low latency.
Main Contribution
Graphiti: a temporally-aware knowledge graph with three tiers—episodes, semantic entities, communities
Bi-temporal modeling and edge invalidation to track when facts become valid/invalid
Hybrid retrieval: vector (cosine), BM25, and breadth-first graph search plus multiple rerankers (RRF, MMR, crossencoders)
Empirical results showing better accuracy and much lower latency on DMR and LongMemEval benchmarks
Practical design choices for production: incremental community updates, Cypher-based ingestion, and embedding-based resolution
Key Findings
Zep edges back to MemGPT on DMR with gpt-4-turbo
Zep gave large accuracy boosts on a long, realistic benchmark
Zep reduced response latency substantially by returning smaller contexts
Performance declined on one question type
DMR benchmark is limited for enterprise memory evaluation
Results
Accuracy
Accuracy
Accuracy
Latency reduction
Who Should Care
What To Try In 7 Days
Index a week of multi-session chat logs into Graphiti and compare answer accuracy vs full-context prompts
Run LongMemEval or task-focused temporal questions on your data to measure real gains
Replace full-conversation context with top-N Graphiti facts and measure API token and latency savings
Agent Features
Memory
- episodic memory (raw messages)
- semantic memory (entities and facts)
- community summaries (high-level clusters)
Tool Use
- RAG-style retrieval
- crossencoder rerankers
- graph traversal (BFS)
Frameworks
- Graphiti
- Zep
Is Agentic
true
Architectures
- temporal knowledge graph
- hierarchical subgraphs (episode/entity/community)
Optimization Features
Token Efficiency
- reduces context tokens from ~115k to ~1.6k when retrieving targeted facts
Infra Optimization
- hybrid Lucene + vector index via Neo4j
- ability to run rerankers selectively to balance cost
System Optimization
- incremental community updates to avoid full recompute
- Cypher-based ingestion for consistent schema
Inference Optimization
- smaller context prompts via retrieved facts
- use of rerankers to reduce LLM calls
Reproducibility
Data Urls
Code Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- DMR is small and often fits in model context windows; its results overstate practical difficulty (Sec.4.2)
- LongMemEval experiments ran from a residential laptop with AWS-hosted service, adding network latency variability (Sec.4.3)
- Dynamic community updates are approximate and require periodic full refreshes
- Some question types (single-session-assistant) showed notable performance drops
When Not To Use
- When conversation history always fits in your LLM context window
- When you cannot host a graph DB or accept extra infra complexity
- When you need lowest-cost per-query and cannot afford reranker or crossencoder compute
Failure Modes
- Entity resolution mistakes leading to merged or duplicated entities
- Incorrect temporal extraction or edge invalidation producing stale facts
- High-cost crossencoder reranking harming latency and budget
- Community divergence over long incremental updates without refresh
Core Entities
Models
- gpt-4-turbo
- gpt-4o-mini-2024-07-18
- gpt-4o-2024-11-20
- gpt-4o
- BGE-m3
Metrics
- Accuracy
- latency
- avg_context_tokens
Datasets
- Deep Memory Retrieval (DMR)
- LongMemEval s
- Multi-Session Chat subset
Benchmarks
- DMR
- LongMemEval

