Overview
MemWeaver is practically useful: it externalizes time and relational facts into a KG and links raw evidence, giving robust, traceable answers with much smaller LM input sizes.
Citations0
Evidence Strength0.80
Confidence0.86
Risk Signals11
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 5/5
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 70%
Production readiness: 70%
Novelty: 60%
Why It Matters For Business
MemWeaver cuts inference token cost by >95% while improving time-sensitive and multi-hop accuracy, so you can support long-running personalized agents without huge prompt costs or loss of traceability.
Who Should Care
Summary TLDR
MemWeaver is a production-minded memory system for language-model agents that combines a temporally grounded knowledge graph, clustered experience summaries, and passage-level evidence. It uses a dual-channel retriever that returns compact, traceable contexts to the LM. On the LoCoMo long-horizon QA benchmark, MemWeaver keeps inference inputs near 1k tokens (vs ~22k), improves temporal and multi-hop accuracy, and preserves supporting passages for traceability. Code and data are publicly linked.
Problem Statement
LLM agents in multi-session settings need memories that keep facts time-consistent, composable across sessions, and traceable to source text. Existing flat retrieval or coarse summaries are either brittle for time-sensitive queries or weakly grounded, leading to errors and poor explainability.
Main Contribution
A tri-layer memory design: Graph Memory (time-normalized KG), Experience Memory (clustered reusable items), and Passage Memory (raw evidence).
A dual-channel retrieval pipeline that fetches structured triples plus supporting passages and experience items to build compact inference contexts.
Key Findings
MemWeaver reduces inference input length by over 95% compared to long-context prompting.
MemWeaver improves Temporal reasoning F1 substantially.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Inference context length | ~1,000 tokens (MemWeaver) | ~22,000 tokens (LoCoMo long-context) | >95% reduction | LoCoMo | Token Efficiency section; main results table | Section 5.3 |
| Temporal F1 (example) | 50.83 | 38.77 (A-Mem) | +12.06 F1 | LoCoMo, GPT-4o-mini backbone | Table 1, GPT-4o-mini rows | Section 5.3 |
What To Try In 7 Days
Build an offline tri-layer memory from your chat logs (KG + clustered experiences + passages).
Replace full-history prompting with dual-channel retrieval and cap contexts near 1k tokens to measure token and latency savings.
Add a simple temporal normalization step and session-level KG review to fix obvious time contradictions.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Optimization Features
Token Efficiency
Infra Optimization
System Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Memory construction is done offline using a stronger LLM; online construction with small local LMs is unreliable.
Depends on underlying LLM quality for entity/relation extraction and experience induction.
When Not To Use
When you require millisecond-scale end-to-end latency and cannot tolerate retrieval overhead.
When there is no meaningful multi-session or long-term history to consolidate.
Failure Modes
Incorrect or noisy LLM extractions produce wrong triples that persist until reviewed.
Cluster incoherence leads to spurious experience items and wrong generalizations.

