ShardMemo: budgeted, scope-correct sharded memory using masked MoE routing

Overview

Decision SnapshotNeeds Validation

The paper shows consistent empirical gains across three benchmarks and provides an end-to-end design, supervised router training, and ablations; experiments are on a single-server setup and use matched stacks for fair comparisons.

Citations0

Evidence Strength0.80

Confidence0.78

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 5/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 60%

Authors

Yang Zhao, Chengxiao Dai, Yue Xiu, Mengying Kou, Yuliang Zheng, Dusit Niyato

Links

Abstract / PDF

Why It Matters For Business

ShardMemo reduces retrieval cost and tail latency while improving accuracy for agent workflows, making LLM-based agents faster and more reliable under budgeted memory access.

Who Should Care

Product Manager ML Engineer Engineering Lead Data Scientist

Summary TLDR

ShardMemo is a tiered memory service for agentic LLMs that splits memory into per-agent working state (Tier A), sharded evidence with shard-local ANN indexes and masked Mixture-of-Experts routing (Tier B), and a versioned skill library (Tier C). It enforces eligibility masking before routing, trains a supervised router from evidence→shard labels, and uses cost-aware gating plus TopB/TopP probe caps. On LoCoMo, HotpotQA, and ToolBench, ShardMemo improves answer quality while reducing vectors scanned and tail latency under matched budgets.

Problem Statement

Agentic LLMs need scalable, budgeted memory access that respects scope/permissions and keeps latency bounded. Centralized indexes and heuristic shard routing either become bottlenecks under concurrent access or miss evidence when probing too few shards. The challenge is to select a small set of shards per request (under a probe budget) while keeping high evidence coverage and low cost.

Main Contribution

ShardMemo: a tiered memory service separating per-agent working state (Tier A), sharded evidence with shard-local ANN (Tier B), and a versioned skill library (Tier C).

Scope-before-routing masked MoE router that masks ineligible shards and probes up to a capped budget using TopB or adaptive TopP with optional cost bias.

Key Findings

ShardMemo improves LoCoMo QA F1 over the strongest baseline (GAM).

NumbersSingle-hop F1 64.08 vs 58.38 (+5.70) (Table 1)

Practical UseSwitching to masked MoE routing with eligibility masking can raise answer F1 by ~5–7 points on long-horizon conversational memory workloads.

Evidence RefTable 1

Under a fixed probe budget (B_probe=3), ShardMemo reduces retrieval work and tail latency while increasing accuracy.

NumbersVecScan 414 vs 521 (-20.5%); p95 76ms vs 95ms; F1 +6.87 (Table 4)

Practical UseYou can lower vectors scanned and p95 latency while improving accuracy by enforcing eligibility masks and training a router instead of cosine-to-prototype routing.

Evidence RefTable 4

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
LoCoMo Single-hop F1 (GPT-OSS-120B)	64.08 (ShardMemo)	58.38 (GAM)	+5.70	LoCoMo Single Hop	Table 1 reports single-hop F1 for ShardMemo vs GAM	Table 1
LoCoMo Multi-hop F1 (GPT-OSS-120B)	46.28 (ShardMemo)	41.17 (GAM)	+5.11	LoCoMo Multi Hop	Table 1 reports multi-hop F1 improvements	Table 1

What To Try In 7 Days

Enforce scope-before-routing in your retrieval layer to avoid probing ineligible shards.

Measure VecScan and p95 and compare against a simple cosine-to-prototype router under a fixed probe cap.

If you have evidence→shard labels, train a small masked MoE router and evaluate hit-rate and cost.

Agent Features

Memory

Short-term per-agent working memory (Tier A)Sharded evidence memory with eligibility masks (Tier B)Versioned procedural skills (Tier C)

Tool Use

Versioned skill library (Tier C) for reusable procedures

Frameworks

TopB probeAdaptive TopPCost-aware gating

Is Agentic

Yes

Architectures

Tiered memory (A/B/C)Masked MoE routing (shard-as-expert)Shard-local ANN indexes

Optimization Features

Token Efficiency

Merges shard-local candidates to return TopK evidence; reduces vectors scanned under budget

Infra Optimization

Sharded storage and shard-local ANN indexes (parallel search, localized work)

System Optimization

Scope-before-routing eligibility masks reduce wasted retrieval workShard-local ANN reduces per-query index work and improves parallelism

Training Optimization

Supervised router training from evidence→shard labels (multi-positive set-likelihood objective)

Inference Optimization

Budgeted probe cap (B_probe) to limit shards probedAdaptive TopP to trade probability mass vs probe countCost-aware gating to favor low-cost shard families

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

Experiments run on a single controlled server; behavior at large multi-node scale is untested.

Router training needs evidence→shard labels; untrained masked MoE performs poorly.

When Not To Use

Very small corpora where sharding overhead outweighs benefits.

When you cannot obtain evidence→shard supervision and cannot tolerate untrained routers.

Failure Modes

Untrained masked MoE routing yields lower accuracy and higher cost (shown in ablation).

Filtering ineligible shards only after retrieval reduces ShardHit and increases cost.

Core Entities

Models

GPT-OSS-120B

Metrics

F1BLEU-1Precision@RStepRedVecScanp95 latencyShardHit@B_probe

Datasets

LoCoMoHotpotQAToolBench

Benchmarks

LoCoMoHotpotQAToolBench

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

ShardMemo improves LoCoMo QA F1 over the strongest baseline (GAM).

Under a fixed probe budget (B_probe=3), ShardMemo reduces retrieval work and tail latency while increasing accuracy.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Survey of how LLMs become autonomous agents, the core architecture, and the research gaps to make them safe and practical.

Key finding

Agentic ROI: prioritize real user value, not raw model scores

Key finding

Hierarchical multi-agent research agent that compresses long context, routes subtasks to specialized tools, and self-corrects failures.

Key finding

Declarative agent spec plus a runtime that enforces safety, memory, and low-latency execution

Key finding

Jointly erase private facts from an LLM agent's weights and persistent memory to stop recontamination

Key finding