MemWeaver: tri-layer, temporally grounded memory that boosts long-horizon agent reasoning

Overview

Decision SnapshotNeeds Validation

MemWeaver is practically useful: it externalizes time and relational facts into a KG and links raw evidence, giving robust, traceable answers with much smaller LM input sizes.

Citations0

Evidence Strength0.80

Confidence0.86

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 60%

Authors

Juexiang Ye, Xue Li, Xinyu Yang, Chengkai Huang, Lanshun Nie, Lina Yao, Dechen Zhan

Links

Abstract / PDF / Code / Data

Why It Matters For Business

MemWeaver cuts inference token cost by >95% while improving time-sensitive and multi-hop accuracy, so you can support long-running personalized agents without huge prompt costs or loss of traceability.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Data Scientist

Summary TLDR

MemWeaver is a production-minded memory system for language-model agents that combines a temporally grounded knowledge graph, clustered experience summaries, and passage-level evidence. It uses a dual-channel retriever that returns compact, traceable contexts to the LM. On the LoCoMo long-horizon QA benchmark, MemWeaver keeps inference inputs near 1k tokens (vs ~22k), improves temporal and multi-hop accuracy, and preserves supporting passages for traceability. Code and data are publicly linked.

Problem Statement

LLM agents in multi-session settings need memories that keep facts time-consistent, composable across sessions, and traceable to source text. Existing flat retrieval or coarse summaries are either brittle for time-sensitive queries or weakly grounded, leading to errors and poor explainability.

Main Contribution

A tri-layer memory design: Graph Memory (time-normalized KG), Experience Memory (clustered reusable items), and Passage Memory (raw evidence).

A dual-channel retrieval pipeline that fetches structured triples plus supporting passages and experience items to build compact inference contexts.

Key Findings

MemWeaver reduces inference input length by over 95% compared to long-context prompting.

Numbers>95% token reduction (22k → ~1k tokens per query)

Practical UseCut inference token cost dramatically; use MemWeaver when long-history prompts are prohibitive.

Evidence RefMain results, Token Efficiency section

MemWeaver improves Temporal reasoning F1 substantially.

NumbersTemporal F1 GPT-4o-mini: 38.77 → 50.83 (+12.06 F1)

Practical UseAdd temporal KG grounding if you need accurate time-aware answers across sessions.

Evidence RefSection 5.3, Main Results / Table 1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Inference context length	~1,000 tokens (MemWeaver)	~22,000 tokens (LoCoMo long-context)	>95% reduction	LoCoMo	Token Efficiency section; main results table	Section 5.3
Temporal F1 (example)	50.83	38.77 (A-Mem)	+12.06 F1	LoCoMo, GPT-4o-mini backbone	Table 1, GPT-4o-mini rows	Section 5.3

What To Try In 7 Days

Build an offline tri-layer memory from your chat logs (KG + clustered experiences + passages).

Replace full-history prompting with dual-channel retrieval and cap contexts near 1k tokens to measure token and latency savings.

Add a simple temporal normalization step and session-level KG review to fix obvious time contradictions.

Agent Features

Memory

Graph Memory (time-normalized triples)Experience Memory (clustered reusable items)Passage Memory (raw evidence index)

Planning

session-level KG review (add/update/deny)experience induction to capture recurring patterns

Tool Use

LLM-based entity/relation extractionLLM-based cluster coherence and routingoffline builder (DeepSeek-V3.2) for memory construction

Frameworks

dual-channel retrievalKG triple index + dense text retriever

Is Agentic

Yes

Architectures

tri-layer memory (Graph/Experience/Passage)temporally grounded knowledge graph

Collaboration

structured links let LLM combine graph facts and passages during inference

Optimization Features

Token Efficiency

compact context ~1k tokens vs ~22k long-context prompting

Infra Optimization

trades retrieval latency for lower LM compute during generation

System Optimization

offline memory construction to avoid repeated LLM writes at inferencebuffered cluster updates to amortize LLM extraction cost

Inference Optimization

drastically reduced LM input tokens (↓ >95%)selective retrieval budgets (kr=kp=ke=6 default)

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/Chengkai-Huang/MemWeaver_code

Data URLs

https://github.com/snap-research/locomo

Risks & Boundaries

Limitations

Memory construction is done offline using a stronger LLM; online construction with small local LMs is unreliable.

Depends on underlying LLM quality for entity/relation extraction and experience induction.

When Not To Use

When you require millisecond-scale end-to-end latency and cannot tolerate retrieval overhead.

When there is no meaningful multi-session or long-term history to consolidate.

Failure Modes

Incorrect or noisy LLM extractions produce wrong triples that persist until reviewed.

Cluster incoherence leads to spurious experience items and wrong generalizations.

Core Entities

Models

GPT-4o-miniLlama3.2-3BLlama3.2-1BQwen2.5-1.5BDeepSeek-V3.2

Metrics

token-level F1BLEU-1ROUGE-2ROUGE-LExact Match (EM)SBERT similarity

Datasets

LoCoMo

Benchmarks

LoCoMo long-horizon QA

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

MemWeaver reduces inference input length by over 95% compared to long-context prompting.

MemWeaver improves Temporal reasoning F1 substantially.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Survey of how LLMs become autonomous agents, the core architecture, and the research gaps to make them safe and practical.

Key finding

Agentic ROI: prioritize real user value, not raw model scores

Key finding

Hierarchical multi-agent research agent that compresses long context, routes subtasks to specialized tools, and self-corrects failures.

Key finding

Declarative agent spec plus a runtime that enforces safety, memory, and low-latency execution

Key finding

Jointly erase private facts from an LLM agent's weights and persistent memory to stop recontamination

Key finding