Query-aware indexing cuts memory search to ~11ms — 47× faster while keeping competitive accuracy

Overview

Decision SnapshotReady For Pilot

SwiftMem demonstrates clear engineering gains in latency and practical mechanisms (time and tag indexing). Evidence comes from standard benchmarks and ablations, but full system code is not yet released and evaluation relies on LLM-as-judge metrics.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 6/6

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 70%

Authors

Anxin Tian, Yiming Li, Xing Li, Hui-Ling Zhen, Lei Chen, Xianzhi Yu, Zhenhua Dong, Mingxuan Yuan

Links

Abstract / PDF

Why It Matters For Business

Cut memory search latency into the low tens of milliseconds so memory-augmented agents respond in real time while lowering infrastructure and throughput costs.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Founder

Summary TLDR

SwiftMem replaces brute-force memory scans with a three-tier, query-aware index (temporal index, semantic DAG-tag index, and embedding index with co-consolidation). On LoCoMo and LongMemEval, SwiftMem reduces search latency to ~11ms (47× faster vs. some baselines), keeps competitive semantic accuracy (overall LLM score ~0.70 on LoCoMo), and raises lexical precision (BLEU-1 0.467). The system focuses each query on a small subset of memory using time and topic signals, making real-time memory-augmented agents practical.

Problem Statement

Existing agent memory systems scan the entire memory for every query (O(N_mem)). As conversation history grows, search latency balloons and real-time agent responses become impractical. SwiftMem targets this scalability and latency bottleneck by indexing memory by time and semantic tags so queries search only relevant subsets.

Main Contribution

Three-tier query-aware indexing: temporal index, semantic DAG-Tag index, and embedding index with co-consolidation.

Temporal index enabling binary-searchable user timelines for O(log N_mem) time-range queries.

Key Findings

Search latency reduced to ~11 ms per query on LoCoMo.

NumbersSearch = 11 ms (Table 2)

Practical UseExpect sub-15ms retrieval for real-time agents; drop end-to-end inference cost compared to full-history processing.

Evidence RefTable 2

Measured up to 47× faster search compared to state-of-the-art memory frameworks.

Numbers47× speedup (11 ms vs. 522 ms baseline example)

Practical UseLarge memory stores can be served interactively without huge CPU/GPU cost by adopting query-aware indexing.

Evidence RefAbstract, Table 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Search latency (SwiftMem)	11 ms	Nemori 835 ms; Zep 522 ms	≈47× faster vs. 522ms example	LoCoMo	Table 2 reports 11 ms search latency for SwiftMem	Table 2
Total end-to-end latency (SwiftMem)	1,289 ms	FullContext 5,806 ms; RAG-4096 2,884 ms	≈4.5× faster than FullContext	LoCoMo	Table 2 shows total latency 1,289 ms	Table 2

What To Try In 7 Days

Add a lightweight temporal index for user timelines to speed time-based queries.

Tag episodes with a small set of topic tags and route queries to those tags first.

Batch a periodic embedding-tag consolidation to colocate semantically related vectors and measure latency change.

Agent Features

Memory

episodic memory (timestamped episodes)semantic DAG-Tag index (hierarchical tags)embedding index with co-consolidation

Tool Use

LLM-based tag generation (for semantic tags)

Frameworks

LLM-as-Judge evaluation

Is Agentic

Yes

Architectures

three-tier indexing (temporal, DAG-tag, embedding)

Optimization Features

Infra Optimization

reduced search CPU/GPU workload from smaller candidate sets

System Optimization

binary-searchable user timelines for O(log N_mem) querieshierarchical tag expansion to narrow search

Inference Optimization

sub-linear retrieval via tag+temporal filteringimproved cache locality from co-consolidation

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

Relies on LLM-based tag generation; tag errors or inconsistent tags can harm routing.

Temporal index helps only when queries include or allow reliable time cues.

When Not To Use

When you must guarantee the absolute highest semantic accuracy—FullContext scored higher on some LLM metrics.

When conversational memory is tiny and exhaustive search cost is negligible.

Failure Modes

Poor tag mapping causes relevant memories to be excluded from the candidate set.

Over-aggressive temporal filtering could omit relevant episodes for queries with vague time cues.

Core Entities

Models

GPT-4.1-mini (LLM-as-judge)GPT-4o-mini (evaluation)

Metrics

LLM ScoreF1BLEU-1Search latency (ms)Total latency (ms)

Datasets

LoCoMoLongMemEval

Benchmarks

LoCoMoLongMemEval

Context Entities

Models

NemoriLangMemMem0ZepRAG-4096FullContext

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Search latency reduced to ~11 ms per query on LoCoMo.

Measured up to 47× faster search compared to state-of-the-art memory frameworks.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

You May Also Want to Read

Survey of how LLMs become autonomous agents, the core architecture, and the research gaps to make them safe and practical.

Key finding

Agentic ROI: prioritize real user value, not raw model scores

Key finding

Hierarchical multi-agent research agent that compresses long context, routes subtasks to specialized tools, and self-corrects failures.

Key finding

Declarative agent spec plus a runtime that enforces safety, memory, and low-latency execution

Key finding

Jointly erase private facts from an LLM agent's weights and persistent memory to stop recontamination

Key finding