Reduce KV memory and bandwidth for MoE inference by sharding, routing, compression, and adaptive scheduling

August 2, 20256 min

Overview

Decision SnapshotNeeds Validation

Prototype implemented and tested up to 16 GPUs. Paper provides analytic formulas and system design but lacks large-scale production experiments and end‑to‑end numerical benchmarks in the text.

Citations0

Evidence Strength0.50

Confidence0.72

Risk Signals9

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 0/0

Reproducibility

Status: Partial assets available

Open source: Yes

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Dong Liu, Yanxuan Yu, Ben Lengerich, Ying Nian Wu, Xuhong Wang

Links

Abstract / PDF / Code

Why It Matters For Business

If you serve MoE models or long-context LLMs, KV cache memory and network traffic are major cost drivers. PiKV helps cut per-GPU memory and inter-GPU bandwidth by sharding, selective access, and compression—reducing infrastructure needs or enabling longer contexts on the same cluster.

Who Should Care

Summary TLDR

PiKV is an open-source system that treats the KV cache for Mixture-of-Experts (MoE) models as a distributed, query-driven service. It shards KV storage by expert across GPUs, routes tokens to a small set of experts per query, applies modular compression, and uses activity-aware scheduling to keep only useful entries in fast memory. The prototype runs on up to 16 GPUs and integrates LoRA/PyramidKV-style compression to cut memory and communication costs.

Problem Statement

MoE models keep dense, globally-shared KV caches which blow up memory and cross‑GPU communication as context and expert counts grow. This causes high latency and makes long-context MoE inference costly or infeasible on typical GPU clusters.

Main Contribution

Expert-sharded KV layout: assign KV shards to GPUs so each device stores only a fraction of the global cache.

Sparse, cache-aware routing: route each query to a small top-k subset of experts to avoid touching the whole cache.

Key Findings

Expert-sharded storage changes per-device memory scaling from proportional to E·L to proportional to L/G + L/E (analytic).

Numbersper-device memory: O(E·L) -> O(L/G + L/E) (Section 3.1 analytic)

Practical UseShard KV by expert across GPUs to cut per‑GPU KV footprint. Use the paper's closed-form to pick shard capacity S for your token budget and compression ratio.

Evidence RefSection 3.1

A concrete example: a 7B MoE with 128K context and 16 experts yields a full KV cache >24 GB.

Numbers7B model, 128K context, 16 experts → full KV >24 GB (Intro)

Practical UseLong-context MoE models can exceed single‑GPU memory. Apply sharding + compression to avoid large cross‑GPU transfers or OOMs.

Evidence RefIntroduction

What To Try In 7 Days

Clone PiKV repo and run prototype on a small multi-GPU node (link in code_urls).

Measure baseline KV memory and cross‑GPU traffic for your MoE model, then enable expert-sharding only to compare.

Enable one compression module (LoRA or PyramidKV) and check memory vs decoding latency trade-offs.

Agent Features

Memory
expert-sharded KV cachecircular buffer per shard
Planning
query-aware stream scheduling
Tool Use
LoRAPyramidKVDuoNCCL/TCP for communication
Frameworks
PiKV
Architectures
MoE

Optimization Features

Token Efficiency
reduces token-to-KV access by routing to k ≪ E experts
Infra Optimization
reduces cross-GPU communication and memory pressuredesign rules for shard capacity S to minimize per-GPU memory
System Optimization
shard placement and circular buffers for O(1) insertionsasynchronous pipeline to overlap routing/compression/IO
Inference Optimization
expert-sharded KV storage to reduce per-GPU memorysparse expert routing to avoid large KV readsmodular compression to reduce KV bytesquery-aware scheduling to retain high-utility pages

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

Prototype evaluated up to 16 GPUs; behavior on hundreds of GPUs is untested.

Paper gives analytic and design rules but provides few detailed end-to-end quantitative benchmarks in text.

When Not To Use

You are not using an MoE model (no expert sparsity gains).

Cluster has abundant GPU memory and bandwidth for full dense KV caches (simple replication is cheaper).

Failure Modes

Evicting or compressing high-utility KV pages can degrade generation quality if scheduler or compressor mis-estimates utility.

Routing mistakes (wrong top-k experts) cause missed KV entries and worse attention context.

Core Entities

Models

MoE

Metrics

per-GPU memory (analytic)KV read/decode time vs compression ratioreuse-distance

Context Entities

Models

7B-scale MoE (example)