Query-aware indexing cuts memory search to ~11ms — 47× faster while keeping competitive accuracy

January 13, 20267 min

Overview

Production Readiness

0.7

Novelty Score

0.7

Cost Impact Score

0.8

Citation Count

0

Authors

Anxin Tian, Yiming Li, Xing Li, Hui-Ling Zhen, Lei Chen, Xianzhi Yu, Zhenhua Dong, Mingxuan Yuan

Links

Abstract / PDF

Why It Matters For Business

Cut memory search latency into the low tens of milliseconds so memory-augmented agents respond in real time while lowering infrastructure and throughput costs.

Summary TLDR

SwiftMem replaces brute-force memory scans with a three-tier, query-aware index (temporal index, semantic DAG-tag index, and embedding index with co-consolidation). On LoCoMo and LongMemEval, SwiftMem reduces search latency to ~11ms (47× faster vs. some baselines), keeps competitive semantic accuracy (overall LLM score ~0.70 on LoCoMo), and raises lexical precision (BLEU-1 0.467). The system focuses each query on a small subset of memory using time and topic signals, making real-time memory-augmented agents practical.

Problem Statement

Existing agent memory systems scan the entire memory for every query (O(N_mem)). As conversation history grows, search latency balloons and real-time agent responses become impractical. SwiftMem targets this scalability and latency bottleneck by indexing memory by time and semantic tags so queries search only relevant subsets.

Main Contribution

Three-tier query-aware indexing: temporal index, semantic DAG-Tag index, and embedding index with co-consolidation.

Temporal index enabling binary-searchable user timelines for O(log N_mem) time-range queries.

Semantic DAG-Tag routing that maps queries to small tag sets and expands them hierarchically to avoid full scans.

Embedding-tag co-consolidation that reorganizes embeddings by semantic clusters to improve cache locality and speed.

Key Findings

Search latency reduced to ~11 ms per query on LoCoMo.

NumbersSearch = 11 ms (Table 2)

Measured up to 47× faster search compared to state-of-the-art memory frameworks.

Numbers47× speedup (11 ms vs. 522 ms baseline example)

Maintains competitive semantic accuracy while improving lexical precision.

NumbersOverall LLM score 0.704; BLEU-1 = 0.467 (highest)

Scales stably: search latency stays below 15ms as dataset grows.

NumbersLatencies 12.61ms, 11.55ms, 10.62ms across scales

Temporal indexing reduces search latency by ~35% when temporal hints are present.

NumbersLatency 11.1ms → 7.2ms (35% reduction)

Tag-embedding co-consolidation preserves recall while improving judge score and latency.

NumbersLLM judge 64.3% → 78.6%; latency 10.2ms → 7.4ms; recall 90.5% stable

Results

Search latency (SwiftMem)

Value11 ms

BaselineNemori 835 ms; Zep 522 ms

Total end-to-end latency (SwiftMem)

Value1,289 ms

BaselineFullContext 5,806 ms; RAG-4096 2,884 ms

Overall semantic quality (LLM Score)

Value0.704

BaselineNemori 0.792; FullContext 0.806

Lexical precision (BLEU-1)

Value0.467 (highest among baselines)

BaselineNemori 0.445; FullContext 0.450

Temporal-index latency improvement

Value11.1 ms → 7.2 ms

BaselineNo temporal index

Co-consolidation effect (LLM judge & latency)

ValueLLM judge 64.3% → 78.6%; latency 10.2 ms → 7.4 ms

BaselineBefore consolidation

Who Should Care

What To Try In 7 Days

Add a lightweight temporal index for user timelines to speed time-based queries.

Tag episodes with a small set of topic tags and route queries to those tags first.

Batch a periodic embedding-tag consolidation to colocate semantically related vectors and measure latency change.

Agent Features

Memory

  • episodic memory (timestamped episodes)
  • semantic DAG-Tag index (hierarchical tags)
  • embedding index with co-consolidation

Tool Use

  • LLM-based tag generation (for semantic tags)

Frameworks

  • LLM-as-Judge evaluation

Is Agentic

true

Architectures

  • three-tier indexing (temporal, DAG-tag, embedding)

Optimization Features

Infra Optimization

  • reduced search CPU/GPU workload from smaller candidate sets

System Optimization

  • binary-searchable user timelines for O(log N_mem) queries
  • hierarchical tag expansion to narrow search

Inference Optimization

  • sub-linear retrieval via tag+temporal filtering
  • improved cache locality from co-consolidation

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Relies on LLM-based tag generation; tag errors or inconsistent tags can harm routing.
  • Temporal index helps only when queries include or allow reliable time cues.
  • Evaluation uses LLM-as-judge which can introduce judge bias and may not replace human assessment.
  • Source code not yet released, limiting immediate reproducibility.

When Not To Use

  • When you must guarantee the absolute highest semantic accuracy—FullContext scored higher on some LLM metrics.
  • When conversational memory is tiny and exhaustive search cost is negligible.
  • If you cannot generate reliable semantic tags or timestamps for episodes.

Failure Modes

  • Poor tag mapping causes relevant memories to be excluded from the candidate set.
  • Over-aggressive temporal filtering could omit relevant episodes for queries with vague time cues.
  • Consolidation may create transient layout changes that affect hot-path performance during reorganization.

Core Entities

Models

  • GPT-4.1-mini (LLM-as-judge)
  • GPT-4o-mini (evaluation)

Metrics

  • LLM Score
  • F1
  • BLEU-1
  • Search latency (ms)
  • Total latency (ms)

Datasets

  • LoCoMo
  • LongMemEval

Benchmarks

  • LoCoMo
  • LongMemEval

Context Entities

Models

  • Nemori
  • LangMem
  • Mem0
  • Zep
  • RAG-4096
  • FullContext