Query-aware indexing cuts memory search to ~11ms — 47× faster while keeping competitive accuracy

January 13, 20267 min

Overview

Decision SnapshotReady For Pilot

SwiftMem demonstrates clear engineering gains in latency and practical mechanisms (time and tag indexing). Evidence comes from standard benchmarks and ablations, but full system code is not yet released and evaluation relies on LLM-as-judge metrics.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 6/6

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 70%

Authors

Anxin Tian, Yiming Li, Xing Li, Hui-Ling Zhen, Lei Chen, Xianzhi Yu, Zhenhua Dong, Mingxuan Yuan

Links

Abstract / PDF

Why It Matters For Business

Cut memory search latency into the low tens of milliseconds so memory-augmented agents respond in real time while lowering infrastructure and throughput costs.

Who Should Care

Summary TLDR

SwiftMem replaces brute-force memory scans with a three-tier, query-aware index (temporal index, semantic DAG-tag index, and embedding index with co-consolidation). On LoCoMo and LongMemEval, SwiftMem reduces search latency to ~11ms (47× faster vs. some baselines), keeps competitive semantic accuracy (overall LLM score ~0.70 on LoCoMo), and raises lexical precision (BLEU-1 0.467). The system focuses each query on a small subset of memory using time and topic signals, making real-time memory-augmented agents practical.

Problem Statement

Existing agent memory systems scan the entire memory for every query (O(N_mem)). As conversation history grows, search latency balloons and real-time agent responses become impractical. SwiftMem targets this scalability and latency bottleneck by indexing memory by time and semantic tags so queries search only relevant subsets.

Main Contribution

Three-tier query-aware indexing: temporal index, semantic DAG-Tag index, and embedding index with co-consolidation.

Temporal index enabling binary-searchable user timelines for O(log N_mem) time-range queries.

Key Findings

Search latency reduced to ~11 ms per query on LoCoMo.

NumbersSearch = 11 ms (Table 2)

Practical UseExpect sub-15ms retrieval for real-time agents; drop end-to-end inference cost compared to full-history processing.

Evidence RefTable 2

Measured up to 47× faster search compared to state-of-the-art memory frameworks.

Numbers47× speedup (11 ms vs. 522 ms baseline example)

Practical UseLarge memory stores can be served interactively without huge CPU/GPU cost by adopting query-aware indexing.

Evidence RefAbstract, Table 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Search latency (SwiftMem)11 msNemori 835 ms; Zep 522 ms≈47× faster vs. 522ms exampleLoCoMoTable 2 reports 11 ms search latency for SwiftMemTable 2
Total end-to-end latency (SwiftMem)1,289 msFullContext 5,806 ms; RAG-4096 2,884 ms≈4.5× faster than FullContextLoCoMoTable 2 shows total latency 1,289 msTable 2

What To Try In 7 Days

Add a lightweight temporal index for user timelines to speed time-based queries.

Tag episodes with a small set of topic tags and route queries to those tags first.

Batch a periodic embedding-tag consolidation to colocate semantically related vectors and measure latency change.

Agent Features

Memory
episodic memory (timestamped episodes)semantic DAG-Tag index (hierarchical tags)embedding index with co-consolidation
Tool Use
LLM-based tag generation (for semantic tags)
Frameworks
LLM-as-Judge evaluation
Is Agentic

Yes

Architectures
three-tier indexing (temporal, DAG-tag, embedding)

Optimization Features

Infra Optimization
reduced search CPU/GPU workload from smaller candidate sets
System Optimization
binary-searchable user timelines for O(log N_mem) querieshierarchical tag expansion to narrow search
Inference Optimization
sub-linear retrieval via tag+temporal filteringimproved cache locality from co-consolidation

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Relies on LLM-based tag generation; tag errors or inconsistent tags can harm routing.

Temporal index helps only when queries include or allow reliable time cues.

When Not To Use

When you must guarantee the absolute highest semantic accuracy—FullContext scored higher on some LLM metrics.

When conversational memory is tiny and exhaustive search cost is negligible.

Failure Modes

Poor tag mapping causes relevant memories to be excluded from the candidate set.

Over-aggressive temporal filtering could omit relevant episodes for queries with vague time cues.

Core Entities

Models

GPT-4.1-mini (LLM-as-judge)GPT-4o-mini (evaluation)

Metrics

LLM ScoreF1BLEU-1Search latency (ms)Total latency (ms)

Datasets

LoCoMoLongMemEval

Benchmarks

LoCoMoLongMemEval

Context Entities

Models

NemoriLangMemMem0ZepRAG-4096FullContext