ShardMemo: budgeted, scope-correct sharded memory using masked MoE routing

January 29, 20268 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

0

Authors

Yang Zhao, Chengxiao Dai, Yue Xiu, Mengying Kou, Yuliang Zheng, Dusit Niyato

Links

Abstract / PDF

Why It Matters For Business

ShardMemo reduces retrieval cost and tail latency while improving accuracy for agent workflows, making LLM-based agents faster and more reliable under budgeted memory access.

Summary TLDR

ShardMemo is a tiered memory service for agentic LLMs that splits memory into per-agent working state (Tier A), sharded evidence with shard-local ANN indexes and masked Mixture-of-Experts routing (Tier B), and a versioned skill library (Tier C). It enforces eligibility masking before routing, trains a supervised router from evidence→shard labels, and uses cost-aware gating plus TopB/TopP probe caps. On LoCoMo, HotpotQA, and ToolBench, ShardMemo improves answer quality while reducing vectors scanned and tail latency under matched budgets.

Problem Statement

Agentic LLMs need scalable, budgeted memory access that respects scope/permissions and keeps latency bounded. Centralized indexes and heuristic shard routing either become bottlenecks under concurrent access or miss evidence when probing too few shards. The challenge is to select a small set of shards per request (under a probe budget) while keeping high evidence coverage and low cost.

Main Contribution

ShardMemo: a tiered memory service separating per-agent working state (Tier A), sharded evidence with shard-local ANN (Tier B), and a versioned skill library (Tier C).

Scope-before-routing masked MoE router that masks ineligible shards and probes up to a capped budget using TopB or adaptive TopP with optional cost bias.

A supervised router-training protocol (evidence→shard labels) and evaluation showing gains in task F1 and reduced retrieval work and tail latency on LoCoMo, HotpotQA, and ToolBench.

Key Findings

ShardMemo improves LoCoMo QA F1 over the strongest baseline (GAM).

NumbersSingle-hop F1 64.08 vs 58.38 (+5.70) (Table 1)

Under a fixed probe budget (B_probe=3), ShardMemo reduces retrieval work and tail latency while increasing accuracy.

NumbersVecScan 414 vs 521 (-20.5%); p95 76ms vs 95ms; F1 +6.87 (Table 4)

Tier C skill retrieval yields higher usable-skill precision and step reduction on ToolBench.

NumbersPrecision@3 0.97 vs 0.88 (+10.2%); StepRed 1.94 vs 1.81 (+7.2%) (Table 3)

Masked MoE router needs supervision; untrained masked routing is worse and costlier.

NumbersUntrained MoE: F1 39.82 and VecScan 627 vs trained 54.21 and 414 (Table 4)

Results

LoCoMo Single-hop F1 (GPT-OSS-120B)

Value64.08 (ShardMemo)

Baseline58.38 (GAM)

LoCoMo Multi-hop F1 (GPT-OSS-120B)

Value46.28 (ShardMemo)

Baseline41.17 (GAM)

HotpotQA F1 (56K / 224K / 448K tokens)

Value63.41 / 61.88 / 57.95 (ShardMemo)

Baseline62.10 / 60.92 / 57.40 (GAM)

ToolBench Precision@3

Value0.97 (Tier C)

Baseline0.88 (Embedding similarity)

VecScan and p95 at B_probe=3

ValueVecScan 414; p95 76 ms (ShardMemo)

BaselineVecScan 521; p95 95 ms (cosine-to-prototype routing)

Who Should Care

What To Try In 7 Days

Enforce scope-before-routing in your retrieval layer to avoid probing ineligible shards.

Measure VecScan and p95 and compare against a simple cosine-to-prototype router under a fixed probe cap.

If you have evidence→shard labels, train a small masked MoE router and evaluate hit-rate and cost.

Agent Features

Memory

  • Short-term per-agent working memory (Tier A)
  • Sharded evidence memory with eligibility masks (Tier B)
  • Versioned procedural skills (Tier C)

Tool Use

  • Versioned skill library (Tier C) for reusable procedures

Frameworks

  • TopB probe
  • Adaptive TopP
  • Cost-aware gating

Is Agentic

true

Architectures

  • Tiered memory (A/B/C)
  • Masked MoE routing (shard-as-expert)
  • Shard-local ANN indexes

Optimization Features

Token Efficiency

  • Merges shard-local candidates to return TopK evidence; reduces vectors scanned under budget

Infra Optimization

  • Sharded storage and shard-local ANN indexes (parallel search, localized work)

System Optimization

  • Scope-before-routing eligibility masks reduce wasted retrieval work
  • Shard-local ANN reduces per-query index work and improves parallelism

Training Optimization

  • Supervised router training from evidence→shard labels (multi-positive set-likelihood objective)

Inference Optimization

  • Budgeted probe cap (B_probe) to limit shards probed
  • Adaptive TopP to trade probability mass vs probe count
  • Cost-aware gating to favor low-cost shard families

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Experiments run on a single controlled server; behavior at large multi-node scale is untested.
  • Router training needs evidence→shard labels; untrained masked MoE performs poorly.
  • Shard-map updates and dynamic rebalancing are discussed but experiments use a fixed shard map.
  • Paper focuses on retrieval and does not address end-to-end cost of large-scale deployment (network, multi-tenant contention).

When Not To Use

  • Very small corpora where sharding overhead outweighs benefits.
  • When you cannot obtain evidence→shard supervision and cannot tolerate untrained routers.
  • Ultra-low-latency appliances where any additional routing step is unacceptable.

Failure Modes

  • Untrained masked MoE routing yields lower accuracy and higher cost (shown in ablation).
  • Filtering ineligible shards only after retrieval reduces ShardHit and increases cost.
  • Misestimated cost bias could over- or under-prioritize shards, hurting hit-rate or cost.
  • Workload drift can force conservative probing or reduce coverage if shard summaries are stale.

Core Entities

Models

  • GPT-OSS-120B

Metrics

  • F1
  • BLEU-1
  • Precision@R
  • StepRed
  • VecScan
  • p95 latency
  • ShardHit@B_probe

Datasets

  • LoCoMo
  • HotpotQA
  • ToolBench

Benchmarks

  • LoCoMo
  • HotpotQA
  • ToolBench