ShardMemo: budgeted, scope-correct sharded memory using masked MoE routing

January 29, 20268 min

Overview

Decision SnapshotNeeds Validation

The paper shows consistent empirical gains across three benchmarks and provides an end-to-end design, supervised router training, and ablations; experiments are on a single-server setup and use matched stacks for fair comparisons.

Citations0

Evidence Strength0.80

Confidence0.78

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 5/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 60%

Authors

Yang Zhao, Chengxiao Dai, Yue Xiu, Mengying Kou, Yuliang Zheng, Dusit Niyato

Links

Abstract / PDF

Why It Matters For Business

ShardMemo reduces retrieval cost and tail latency while improving accuracy for agent workflows, making LLM-based agents faster and more reliable under budgeted memory access.

Who Should Care

Summary TLDR

ShardMemo is a tiered memory service for agentic LLMs that splits memory into per-agent working state (Tier A), sharded evidence with shard-local ANN indexes and masked Mixture-of-Experts routing (Tier B), and a versioned skill library (Tier C). It enforces eligibility masking before routing, trains a supervised router from evidence→shard labels, and uses cost-aware gating plus TopB/TopP probe caps. On LoCoMo, HotpotQA, and ToolBench, ShardMemo improves answer quality while reducing vectors scanned and tail latency under matched budgets.

Problem Statement

Agentic LLMs need scalable, budgeted memory access that respects scope/permissions and keeps latency bounded. Centralized indexes and heuristic shard routing either become bottlenecks under concurrent access or miss evidence when probing too few shards. The challenge is to select a small set of shards per request (under a probe budget) while keeping high evidence coverage and low cost.

Main Contribution

ShardMemo: a tiered memory service separating per-agent working state (Tier A), sharded evidence with shard-local ANN (Tier B), and a versioned skill library (Tier C).

Scope-before-routing masked MoE router that masks ineligible shards and probes up to a capped budget using TopB or adaptive TopP with optional cost bias.

Key Findings

ShardMemo improves LoCoMo QA F1 over the strongest baseline (GAM).

NumbersSingle-hop F1 64.08 vs 58.38 (+5.70) (Table 1)

Practical UseSwitching to masked MoE routing with eligibility masking can raise answer F1 by ~5–7 points on long-horizon conversational memory workloads.

Evidence RefTable 1

Under a fixed probe budget (B_probe=3), ShardMemo reduces retrieval work and tail latency while increasing accuracy.

NumbersVecScan 414 vs 521 (-20.5%); p95 76ms vs 95ms; F1 +6.87 (Table 4)

Practical UseYou can lower vectors scanned and p95 latency while improving accuracy by enforcing eligibility masks and training a router instead of cosine-to-prototype routing.

Evidence RefTable 4

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
LoCoMo Single-hop F1 (GPT-OSS-120B)64.08 (ShardMemo)58.38 (GAM)+5.70LoCoMo Single HopTable 1 reports single-hop F1 for ShardMemo vs GAMTable 1
LoCoMo Multi-hop F1 (GPT-OSS-120B)46.28 (ShardMemo)41.17 (GAM)+5.11LoCoMo Multi HopTable 1 reports multi-hop F1 improvementsTable 1

What To Try In 7 Days

Enforce scope-before-routing in your retrieval layer to avoid probing ineligible shards.

Measure VecScan and p95 and compare against a simple cosine-to-prototype router under a fixed probe cap.

If you have evidence→shard labels, train a small masked MoE router and evaluate hit-rate and cost.

Agent Features

Memory
Short-term per-agent working memory (Tier A)Sharded evidence memory with eligibility masks (Tier B)Versioned procedural skills (Tier C)
Tool Use
Versioned skill library (Tier C) for reusable procedures
Frameworks
TopB probeAdaptive TopPCost-aware gating
Is Agentic

Yes

Architectures
Tiered memory (A/B/C)Masked MoE routing (shard-as-expert)Shard-local ANN indexes

Optimization Features

Token Efficiency
Merges shard-local candidates to return TopK evidence; reduces vectors scanned under budget
Infra Optimization
Sharded storage and shard-local ANN indexes (parallel search, localized work)
System Optimization
Scope-before-routing eligibility masks reduce wasted retrieval workShard-local ANN reduces per-query index work and improves parallelism
Training Optimization
Supervised router training from evidence→shard labels (multi-positive set-likelihood objective)
Inference Optimization
Budgeted probe cap (B_probe) to limit shards probedAdaptive TopP to trade probability mass vs probe countCost-aware gating to favor low-cost shard families

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Experiments run on a single controlled server; behavior at large multi-node scale is untested.

Router training needs evidence→shard labels; untrained masked MoE performs poorly.

When Not To Use

Very small corpora where sharding overhead outweighs benefits.

When you cannot obtain evidence→shard supervision and cannot tolerate untrained routers.

Failure Modes

Untrained masked MoE routing yields lower accuracy and higher cost (shown in ablation).

Filtering ineligible shards only after retrieval reduces ShardHit and increases cost.

Core Entities

Models

GPT-OSS-120B

Metrics

F1BLEU-1Precision@RStepRedVecScanp95 latencyShardHit@B_probe

Datasets

LoCoMoHotpotQAToolBench

Benchmarks

LoCoMoHotpotQAToolBench