Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
0
Why It Matters For Business
MemWeaver cuts inference token cost by >95% while improving time-sensitive and multi-hop accuracy, so you can support long-running personalized agents without huge prompt costs or loss of traceability.
Summary TLDR
MemWeaver is a production-minded memory system for language-model agents that combines a temporally grounded knowledge graph, clustered experience summaries, and passage-level evidence. It uses a dual-channel retriever that returns compact, traceable contexts to the LM. On the LoCoMo long-horizon QA benchmark, MemWeaver keeps inference inputs near 1k tokens (vs ~22k), improves temporal and multi-hop accuracy, and preserves supporting passages for traceability. Code and data are publicly linked.
Problem Statement
LLM agents in multi-session settings need memories that keep facts time-consistent, composable across sessions, and traceable to source text. Existing flat retrieval or coarse summaries are either brittle for time-sensitive queries or weakly grounded, leading to errors and poor explainability.
Main Contribution
A tri-layer memory design: Graph Memory (time-normalized KG), Experience Memory (clustered reusable items), and Passage Memory (raw evidence).
A dual-channel retrieval pipeline that fetches structured triples plus supporting passages and experience items to build compact inference contexts.
An end-to-end consolidation flow with LLM-based extraction and session-level review to maintain temporal consistency and traceability.
Comprehensive evaluation on the LoCoMo benchmark showing improved multi-hop and temporal reasoning with much shorter input lengths.
Key Findings
MemWeaver reduces inference input length by over 95% compared to long-context prompting.
MemWeaver improves Temporal reasoning F1 substantially.
MemWeaver yields large gains on adversarial tests.
The system trades modest retrieval overhead for compact inputs.
Results
Inference context length
Temporal F1 (example)
Multi-Hop F1 (example)
Adversarial F1
Memory & Retrieval
Who Should Care
What To Try In 7 Days
Build an offline tri-layer memory from your chat logs (KG + clustered experiences + passages).
Replace full-history prompting with dual-channel retrieval and cap contexts near 1k tokens to measure token and latency savings.
Add a simple temporal normalization step and session-level KG review to fix obvious time contradictions.
Agent Features
Memory
- Graph Memory (time-normalized triples)
- Experience Memory (clustered reusable items)
- Passage Memory (raw evidence index)
Planning
- session-level KG review (add/update/deny)
- experience induction to capture recurring patterns
Tool Use
- LLM-based entity/relation extraction
- LLM-based cluster coherence and routing
- offline builder (DeepSeek-V3.2) for memory construction
Frameworks
- dual-channel retrieval
- KG triple index + dense text retriever
Is Agentic
true
Architectures
- tri-layer memory (Graph/Experience/Passage)
- temporally grounded knowledge graph
Collaboration
- structured links let LLM combine graph facts and passages during inference
Optimization Features
Token Efficiency
- compact context ~1k tokens vs ~22k long-context prompting
Infra Optimization
- trades retrieval latency for lower LM compute during generation
System Optimization
- offline memory construction to avoid repeated LLM writes at inference
- buffered cluster updates to amortize LLM extraction cost
Inference Optimization
- drastically reduced LM input tokens (↓ >95%)
- selective retrieval budgets (kr=kp=ke=6 default)
Reproducibility
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Memory construction is done offline using a stronger LLM; online construction with small local LMs is unreliable.
- Depends on underlying LLM quality for entity/relation extraction and experience induction.
- Current design targets text-only interactions; multimodal support is unaddressed.
- Retrieval increases latency and modest memory overhead compared to flat retrieval.
When Not To Use
- When you require millisecond-scale end-to-end latency and cannot tolerate retrieval overhead.
- When there is no meaningful multi-session or long-term history to consolidate.
- When your data is multimodal (images/audio) and you need immediate multimodal memory support.
Failure Modes
- Incorrect or noisy LLM extractions produce wrong triples that persist until reviewed.
- Cluster incoherence leads to spurious experience items and wrong generalizations.
- Errors in temporal normalization can cause misordered events and wrong answers.
- Over-reliance on offline construction may miss real-time updates or corrections.
Core Entities
Models
- GPT-4o-mini
- Llama3.2-3B
- Llama3.2-1B
- Qwen2.5-1.5B
- DeepSeek-V3.2
Metrics
- token-level F1
- BLEU-1
- ROUGE-2
- ROUGE-L
- Exact Match (EM)
- SBERT similarity
Datasets
- LoCoMo
Benchmarks
- LoCoMo long-horizon QA

