MemWeaver: tri-layer, temporally grounded memory that boosts long-horizon agent reasoning

January 26, 20267 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

0

Authors

Juexiang Ye, Xue Li, Xinyu Yang, Chengkai Huang, Lanshun Nie, Lina Yao, Dechen Zhan

Links

Abstract / PDF

Why It Matters For Business

MemWeaver cuts inference token cost by >95% while improving time-sensitive and multi-hop accuracy, so you can support long-running personalized agents without huge prompt costs or loss of traceability.

Summary TLDR

MemWeaver is a production-minded memory system for language-model agents that combines a temporally grounded knowledge graph, clustered experience summaries, and passage-level evidence. It uses a dual-channel retriever that returns compact, traceable contexts to the LM. On the LoCoMo long-horizon QA benchmark, MemWeaver keeps inference inputs near 1k tokens (vs ~22k), improves temporal and multi-hop accuracy, and preserves supporting passages for traceability. Code and data are publicly linked.

Problem Statement

LLM agents in multi-session settings need memories that keep facts time-consistent, composable across sessions, and traceable to source text. Existing flat retrieval or coarse summaries are either brittle for time-sensitive queries or weakly grounded, leading to errors and poor explainability.

Main Contribution

A tri-layer memory design: Graph Memory (time-normalized KG), Experience Memory (clustered reusable items), and Passage Memory (raw evidence).

A dual-channel retrieval pipeline that fetches structured triples plus supporting passages and experience items to build compact inference contexts.

An end-to-end consolidation flow with LLM-based extraction and session-level review to maintain temporal consistency and traceability.

Comprehensive evaluation on the LoCoMo benchmark showing improved multi-hop and temporal reasoning with much shorter input lengths.

Key Findings

MemWeaver reduces inference input length by over 95% compared to long-context prompting.

Numbers>95% token reduction (22k → ~1k tokens per query)

MemWeaver improves Temporal reasoning F1 substantially.

NumbersTemporal F1 GPT-4o-mini: 38.77 → 50.83 (+12.06 F1)

MemWeaver yields large gains on adversarial tests.

NumbersAdversarial F1 (GPT-4o-mini): 72.2% vs A‑Mem 25.14%

The system trades modest retrieval overhead for compact inputs.

NumbersMemWeaver total memory 13.31 MB; retrieval 41.57 ± 12.85 ms vs MemoryBank 7.23 MB, 17.07 ± 3.61 ms

Results

Inference context length

Value~1,000 tokens (MemWeaver)

Baseline~22,000 tokens (LoCoMo long-context)

Temporal F1 (example)

Value50.83

Baseline38.77 (A-Mem)

Multi-Hop F1 (example)

Value26.00

Baseline24.35 (LoCoMo long-context)

Adversarial F1

Value72.2

Baseline25.14 (A-Mem)

Memory & Retrieval

ValueTotal 13.31 MB; Retrieval 41.57 ± 12.85 ms

BaselineMemoryBank total 7.23 MB; Retrieval 17.07 ± 3.61 ms

Who Should Care

What To Try In 7 Days

Build an offline tri-layer memory from your chat logs (KG + clustered experiences + passages).

Replace full-history prompting with dual-channel retrieval and cap contexts near 1k tokens to measure token and latency savings.

Add a simple temporal normalization step and session-level KG review to fix obvious time contradictions.

Agent Features

Memory

  • Graph Memory (time-normalized triples)
  • Experience Memory (clustered reusable items)
  • Passage Memory (raw evidence index)

Planning

  • session-level KG review (add/update/deny)
  • experience induction to capture recurring patterns

Tool Use

  • LLM-based entity/relation extraction
  • LLM-based cluster coherence and routing
  • offline builder (DeepSeek-V3.2) for memory construction

Frameworks

  • dual-channel retrieval
  • KG triple index + dense text retriever

Is Agentic

true

Architectures

  • tri-layer memory (Graph/Experience/Passage)
  • temporally grounded knowledge graph

Collaboration

  • structured links let LLM combine graph facts and passages during inference

Optimization Features

Token Efficiency

  • compact context ~1k tokens vs ~22k long-context prompting

Infra Optimization

  • trades retrieval latency for lower LM compute during generation

System Optimization

  • offline memory construction to avoid repeated LLM writes at inference
  • buffered cluster updates to amortize LLM extraction cost

Inference Optimization

  • drastically reduced LM input tokens (↓ >95%)
  • selective retrieval budgets (kr=kp=ke=6 default)

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Memory construction is done offline using a stronger LLM; online construction with small local LMs is unreliable.
  • Depends on underlying LLM quality for entity/relation extraction and experience induction.
  • Current design targets text-only interactions; multimodal support is unaddressed.
  • Retrieval increases latency and modest memory overhead compared to flat retrieval.

When Not To Use

  • When you require millisecond-scale end-to-end latency and cannot tolerate retrieval overhead.
  • When there is no meaningful multi-session or long-term history to consolidate.
  • When your data is multimodal (images/audio) and you need immediate multimodal memory support.

Failure Modes

  • Incorrect or noisy LLM extractions produce wrong triples that persist until reviewed.
  • Cluster incoherence leads to spurious experience items and wrong generalizations.
  • Errors in temporal normalization can cause misordered events and wrong answers.
  • Over-reliance on offline construction may miss real-time updates or corrections.

Core Entities

Models

  • GPT-4o-mini
  • Llama3.2-3B
  • Llama3.2-1B
  • Qwen2.5-1.5B
  • DeepSeek-V3.2

Metrics

  • token-level F1
  • BLEU-1
  • ROUGE-2
  • ROUGE-L
  • Exact Match (EM)
  • SBERT similarity

Datasets

  • LoCoMo

Benchmarks

  • LoCoMo long-horizon QA