MemWeaver: tri-layer, temporally grounded memory that boosts long-horizon agent reasoning

January 26, 20267 min

Overview

Decision SnapshotNeeds Validation

MemWeaver is practically useful: it externalizes time and relational facts into a KG and links raw evidence, giving robust, traceable answers with much smaller LM input sizes.

Citations0

Evidence Strength0.80

Confidence0.86

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 60%

Authors

Juexiang Ye, Xue Li, Xinyu Yang, Chengkai Huang, Lanshun Nie, Lina Yao, Dechen Zhan

Links

Abstract / PDF / Code / Data

Why It Matters For Business

MemWeaver cuts inference token cost by >95% while improving time-sensitive and multi-hop accuracy, so you can support long-running personalized agents without huge prompt costs or loss of traceability.

Who Should Care

Summary TLDR

MemWeaver is a production-minded memory system for language-model agents that combines a temporally grounded knowledge graph, clustered experience summaries, and passage-level evidence. It uses a dual-channel retriever that returns compact, traceable contexts to the LM. On the LoCoMo long-horizon QA benchmark, MemWeaver keeps inference inputs near 1k tokens (vs ~22k), improves temporal and multi-hop accuracy, and preserves supporting passages for traceability. Code and data are publicly linked.

Problem Statement

LLM agents in multi-session settings need memories that keep facts time-consistent, composable across sessions, and traceable to source text. Existing flat retrieval or coarse summaries are either brittle for time-sensitive queries or weakly grounded, leading to errors and poor explainability.

Main Contribution

A tri-layer memory design: Graph Memory (time-normalized KG), Experience Memory (clustered reusable items), and Passage Memory (raw evidence).

A dual-channel retrieval pipeline that fetches structured triples plus supporting passages and experience items to build compact inference contexts.

Key Findings

MemWeaver reduces inference input length by over 95% compared to long-context prompting.

Numbers>95% token reduction (22k~1k tokens per query)

Practical UseCut inference token cost dramatically; use MemWeaver when long-history prompts are prohibitive.

Evidence RefMain results, Token Efficiency section

MemWeaver improves Temporal reasoning F1 substantially.

NumbersTemporal F1 GPT-4o-mini: 38.7750.83 (+12.06 F1)

Practical UseAdd temporal KG grounding if you need accurate time-aware answers across sessions.

Evidence RefSection 5.3, Main Results / Table 1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Inference context length~1,000 tokens (MemWeaver)~22,000 tokens (LoCoMo long-context)>95% reductionLoCoMoToken Efficiency section; main results tableSection 5.3
Temporal F1 (example)50.8338.77 (A-Mem)+12.06 F1LoCoMo, GPT-4o-mini backboneTable 1, GPT-4o-mini rowsSection 5.3

What To Try In 7 Days

Build an offline tri-layer memory from your chat logs (KG + clustered experiences + passages).

Replace full-history prompting with dual-channel retrieval and cap contexts near 1k tokens to measure token and latency savings.

Add a simple temporal normalization step and session-level KG review to fix obvious time contradictions.

Agent Features

Memory
Graph Memory (time-normalized triples)Experience Memory (clustered reusable items)Passage Memory (raw evidence index)
Planning
session-level KG review (add/update/deny)experience induction to capture recurring patterns
Tool Use
LLM-based entity/relation extractionLLM-based cluster coherence and routingoffline builder (DeepSeek-V3.2) for memory construction
Frameworks
dual-channel retrievalKG triple index + dense text retriever
Is Agentic

Yes

Architectures
tri-layer memory (Graph/Experience/Passage)temporally grounded knowledge graph
Collaboration
structured links let LLM combine graph facts and passages during inference

Optimization Features

Token Efficiency
compact context ~1k tokens vs ~22k long-context prompting
Infra Optimization
trades retrieval latency for lower LM compute during generation
System Optimization
offline memory construction to avoid repeated LLM writes at inferencebuffered cluster updates to amortize LLM extraction cost
Inference Optimization
drastically reduced LM input tokens (↓ >95%)selective retrieval budgets (kr=kp=ke=6 default)

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

Memory construction is done offline using a stronger LLM; online construction with small local LMs is unreliable.

Depends on underlying LLM quality for entity/relation extraction and experience induction.

When Not To Use

When you require millisecond-scale end-to-end latency and cannot tolerate retrieval overhead.

When there is no meaningful multi-session or long-term history to consolidate.

Failure Modes

Incorrect or noisy LLM extractions produce wrong triples that persist until reviewed.

Cluster incoherence leads to spurious experience items and wrong generalizations.

Core Entities

Models

GPT-4o-miniLlama3.2-3BLlama3.2-1BQwen2.5-1.5BDeepSeek-V3.2

Metrics

token-level F1BLEU-1ROUGE-2ROUGE-LExact Match (EM)SBERT similarity

Datasets

LoCoMo

Benchmarks

LoCoMo long-horizon QA