Overview
The paper shows consistent empirical gains across three benchmarks and provides an end-to-end design, supervised router training, and ablations; experiments are on a single-server setup and use matched stacks for fair comparisons.
Citations0
Evidence Strength0.80
Confidence0.78
Risk Signals11
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 5/5
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 70%
Production readiness: 70%
Novelty: 60%
Why It Matters For Business
ShardMemo reduces retrieval cost and tail latency while improving accuracy for agent workflows, making LLM-based agents faster and more reliable under budgeted memory access.
Who Should Care
Summary TLDR
ShardMemo is a tiered memory service for agentic LLMs that splits memory into per-agent working state (Tier A), sharded evidence with shard-local ANN indexes and masked Mixture-of-Experts routing (Tier B), and a versioned skill library (Tier C). It enforces eligibility masking before routing, trains a supervised router from evidence→shard labels, and uses cost-aware gating plus TopB/TopP probe caps. On LoCoMo, HotpotQA, and ToolBench, ShardMemo improves answer quality while reducing vectors scanned and tail latency under matched budgets.
Problem Statement
Agentic LLMs need scalable, budgeted memory access that respects scope/permissions and keeps latency bounded. Centralized indexes and heuristic shard routing either become bottlenecks under concurrent access or miss evidence when probing too few shards. The challenge is to select a small set of shards per request (under a probe budget) while keeping high evidence coverage and low cost.
Main Contribution
ShardMemo: a tiered memory service separating per-agent working state (Tier A), sharded evidence with shard-local ANN (Tier B), and a versioned skill library (Tier C).
Scope-before-routing masked MoE router that masks ineligible shards and probes up to a capped budget using TopB or adaptive TopP with optional cost bias.
Key Findings
ShardMemo improves LoCoMo QA F1 over the strongest baseline (GAM).
Under a fixed probe budget (B_probe=3), ShardMemo reduces retrieval work and tail latency while increasing accuracy.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| LoCoMo Single-hop F1 (GPT-OSS-120B) | 64.08 (ShardMemo) | 58.38 (GAM) | +5.70 | LoCoMo Single Hop | Table 1 reports single-hop F1 for ShardMemo vs GAM | Table 1 |
| LoCoMo Multi-hop F1 (GPT-OSS-120B) | 46.28 (ShardMemo) | 41.17 (GAM) | +5.11 | LoCoMo Multi Hop | Table 1 reports multi-hop F1 improvements | Table 1 |
What To Try In 7 Days
Enforce scope-before-routing in your retrieval layer to avoid probing ineligible shards.
Measure VecScan and p95 and compare against a simple cosine-to-prototype router under a fixed probe cap.
If you have evidence→shard labels, train a small masked MoE router and evaluate hit-rate and cost.
Agent Features
Memory
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Optimization Features
Token Efficiency
Infra Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Experiments run on a single controlled server; behavior at large multi-node scale is untested.
Router training needs evidence→shard labels; untrained masked MoE performs poorly.
When Not To Use
Very small corpora where sharding overhead outweighs benefits.
When you cannot obtain evidence→shard supervision and cannot tolerate untrained routers.
Failure Modes
Untrained masked MoE routing yields lower accuracy and higher cost (shown in ablation).
Filtering ineligible shards only after retrieval reduces ShardHit and increases cost.

