Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
0
Why It Matters For Business
ShardMemo reduces retrieval cost and tail latency while improving accuracy for agent workflows, making LLM-based agents faster and more reliable under budgeted memory access.
Summary TLDR
ShardMemo is a tiered memory service for agentic LLMs that splits memory into per-agent working state (Tier A), sharded evidence with shard-local ANN indexes and masked Mixture-of-Experts routing (Tier B), and a versioned skill library (Tier C). It enforces eligibility masking before routing, trains a supervised router from evidence→shard labels, and uses cost-aware gating plus TopB/TopP probe caps. On LoCoMo, HotpotQA, and ToolBench, ShardMemo improves answer quality while reducing vectors scanned and tail latency under matched budgets.
Problem Statement
Agentic LLMs need scalable, budgeted memory access that respects scope/permissions and keeps latency bounded. Centralized indexes and heuristic shard routing either become bottlenecks under concurrent access or miss evidence when probing too few shards. The challenge is to select a small set of shards per request (under a probe budget) while keeping high evidence coverage and low cost.
Main Contribution
ShardMemo: a tiered memory service separating per-agent working state (Tier A), sharded evidence with shard-local ANN (Tier B), and a versioned skill library (Tier C).
Scope-before-routing masked MoE router that masks ineligible shards and probes up to a capped budget using TopB or adaptive TopP with optional cost bias.
A supervised router-training protocol (evidence→shard labels) and evaluation showing gains in task F1 and reduced retrieval work and tail latency on LoCoMo, HotpotQA, and ToolBench.
Key Findings
ShardMemo improves LoCoMo QA F1 over the strongest baseline (GAM).
Under a fixed probe budget (B_probe=3), ShardMemo reduces retrieval work and tail latency while increasing accuracy.
Tier C skill retrieval yields higher usable-skill precision and step reduction on ToolBench.
Masked MoE router needs supervision; untrained masked routing is worse and costlier.
Results
LoCoMo Single-hop F1 (GPT-OSS-120B)
LoCoMo Multi-hop F1 (GPT-OSS-120B)
HotpotQA F1 (56K / 224K / 448K tokens)
ToolBench Precision@3
VecScan and p95 at B_probe=3
Who Should Care
What To Try In 7 Days
Enforce scope-before-routing in your retrieval layer to avoid probing ineligible shards.
Measure VecScan and p95 and compare against a simple cosine-to-prototype router under a fixed probe cap.
If you have evidence→shard labels, train a small masked MoE router and evaluate hit-rate and cost.
Agent Features
Memory
- Short-term per-agent working memory (Tier A)
- Sharded evidence memory with eligibility masks (Tier B)
- Versioned procedural skills (Tier C)
Tool Use
- Versioned skill library (Tier C) for reusable procedures
Frameworks
- TopB probe
- Adaptive TopP
- Cost-aware gating
Is Agentic
true
Architectures
- Tiered memory (A/B/C)
- Masked MoE routing (shard-as-expert)
- Shard-local ANN indexes
Optimization Features
Token Efficiency
- Merges shard-local candidates to return TopK evidence; reduces vectors scanned under budget
Infra Optimization
- Sharded storage and shard-local ANN indexes (parallel search, localized work)
System Optimization
- Scope-before-routing eligibility masks reduce wasted retrieval work
- Shard-local ANN reduces per-query index work and improves parallelism
Training Optimization
- Supervised router training from evidence→shard labels (multi-positive set-likelihood objective)
Inference Optimization
- Budgeted probe cap (B_probe) to limit shards probed
- Adaptive TopP to trade probability mass vs probe count
- Cost-aware gating to favor low-cost shard families
Reproducibility
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Experiments run on a single controlled server; behavior at large multi-node scale is untested.
- Router training needs evidence→shard labels; untrained masked MoE performs poorly.
- Shard-map updates and dynamic rebalancing are discussed but experiments use a fixed shard map.
- Paper focuses on retrieval and does not address end-to-end cost of large-scale deployment (network, multi-tenant contention).
When Not To Use
- Very small corpora where sharding overhead outweighs benefits.
- When you cannot obtain evidence→shard supervision and cannot tolerate untrained routers.
- Ultra-low-latency appliances where any additional routing step is unacceptable.
Failure Modes
- Untrained masked MoE routing yields lower accuracy and higher cost (shown in ablation).
- Filtering ineligible shards only after retrieval reduces ShardHit and increases cost.
- Misestimated cost bias could over- or under-prioritize shards, hurting hit-rate or cost.
- Workload drift can force conservative probing or reduce coverage if shard summaries are stale.
Core Entities
Models
- GPT-OSS-120B
Metrics
- F1
- BLEU-1
- Precision@R
- StepRed
- VecScan
- p95 latency
- ShardHit@B_probe
Datasets
- LoCoMo
- HotpotQA
- ToolBench
Benchmarks
- LoCoMo
- HotpotQA
- ToolBench

