Add a semantic timeline and durative summaries so agents recall events at the right time

January 12, 20267 min

Overview

Production Readiness

0.7

Novelty Score

0.65

Cost Impact Score

0.4

Citation Count

0

Authors

Miao Su, Yucan Guo, Zhongni Hou, Long Bai, Zixuan Li, Yufei Zhang, Guojun Yin, Wei Lin, Xiaolong Jin, Jiafeng Guo, Xueqi Cheng

Links

Abstract / PDF

Why It Matters For Business

TSM makes assistants recall facts that happened when they actually happened, improving time-sensitive answers and multi-session personalization—this can reduce wrong or stale recommendations in customer support and personal assistants.

Summary TLDR

This paper introduces Temporal Semantic Memory (TSM), a memory system that records when events actually happen (semantic time) and consolidates related events into durative summaries. TSM builds a Temporal Knowledge Graph (TKG) for event timestamps, clusters events by time into monthly topics/personas, and reranks retrieval results to match a query's time intent. On LONGMEMEVAL and LOCOMO, TSM improves accuracy over strong memory baselines (e.g., +12.2 percentage points vs A‑MEM on LONGMEMEVAL_S) and helps multi-session and temporal reasoning.

Problem Statement

Existing agent memories use dialogue timestamps or isolated event entries. That causes two problems: events get stored under the wrong time, and continuous experiences get split into point records. Agents then fail to retrieve temporally coherent, duration-aware context for time-sensitive or multi-session queries.

Main Contribution

Temporal Semantic Memory (TSM): organizes memory by event time, not chat time, to ground retrieval in actual occurrence intervals.

Durative memory: clusters temporally contiguous and semantically related episodic facts into monthly topic and persona summaries.

Semantic-time retrieval: parses a query's intended time window, filters and reranks candidates using the Temporal Knowledge Graph (TKG) to enforce time validity.

Efficient maintenance: lightweight online updates to the TKG plus periodic ‘sleep-time’ consolidation for summaries.

Key Findings

TSM raises overall QA accuracy on LONGMEMEVAL_S to 74.80%

NumbersTSM 74.80% vs A-MEM 62.60% (+12.20 pp)

Large gains on time-sensitive questions

NumbersTemporal category +22.56 pp (reported on LONGMEMEVAL_S)

Durative summaries improve multi-session reasoning

NumbersMulti-Session accuracy up to 69.17% (TSM) vs 48.87% (A‑MEM) on GPT-4o-mini

Temporal modeling and summaries both matter (ablation)

NumbersRemove temporal: overall -2.0 pp, temporal -6.0 pp; remove summaries: overall -1.2 pp

Results

Accuracy

Value74.80%

BaselineA-MEM 62.60%

Accuracy

Value76.69%

BaselineNaive RAG 63.64%

Accuracy

Value69.92%

BaselineA-MEM 47.36%

Who Should Care

What To Try In 7 Days

Parse user queries for time expressions with spaCy and test retrieval filtered by that window.

Index events with a small Temporal Knowledge Graph (valid_time/invalid_time fields).

Cluster recent events monthly and generate short topic/persona summaries for retrieval trials on a small user cohort.

Agent Features

Memory

  • Episodic memory (time-grounded facts)
  • Durative memory (consolidated, lasting summaries)
  • Hierarchical update: online graph + sleep-time consolidation

Tool Use

  • spaCy for time parsing
  • embedding models for dense retrieval

Frameworks

  • TSM (Temporal Semantic Memory)

Is Agentic

true

Architectures

  • Temporal Knowledge Graph (TKG)
  • Durative memory summaries (monthly topics/personas)

Optimization Features

System Optimization

  • Separate lightweight online graph updates from expensive summary consolidation

Inference Optimization

  • Top-K retrieval (Top-K=25)
  • Sleep-time consolidation to reduce online cost

Reproducibility

Data Urls

  • LONGMEMEVAL (public benchmark)
  • LOCOMO (public benchmark)

Data Available

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Uses fixed monthly granularity for durative summaries; may miss finer or coarser temporal patterns.
  • Focuses on personalization; not evaluated for procedural or shared multi-agent memory.
  • Evaluation uses single-run experiments and two benchmarks, which may limit generality.

When Not To Use

  • When the whole conversation fits the model context window (full-text performs better).
  • When compute or maintenance budget cannot support periodic consolidation or embedding storage.
  • When you need procedural or skill memory rather than user factual/persona memory.

Failure Modes

  • Incorrect time parsing leads to wrong temporal filters and missed evidence.
  • Over-consolidation can drop fine-grained facts needed for some queries.
  • Noisy clustering may mix unrelated events into durative summaries and mislead retrieval.

Core Entities

Models

  • GPT-4o-mini
  • Qwen3-30B-A3B-Instruct-2507

Metrics

  • Accuracy

Datasets

  • LONGMEMEVAL
  • LOCOMO

Benchmarks

  • LongMemEval
  • LOCOMO

Context Entities

Models

  • A-MEM
  • Mem0
  • Mem0 g
  • Zep
  • MemoryOS
  • LangMem
  • Naive RAG