Reduce KV memory and bandwidth for MoE inference by sharding, routing, compression, and adaptive scheduling

Overview

Decision SnapshotNeeds Validation

Prototype implemented and tested up to 16 GPUs. Paper provides analytic formulas and system design but lacks large-scale production experiments and end‑to‑end numerical benchmarks in the text.

Citations0

Evidence Strength0.50

Confidence0.72

Risk Signals9

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 0/0

Reproducibility

Status: Partial assets available

Open source: Yes

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Dong Liu, Yanxuan Yu, Ben Lengerich, Ying Nian Wu, Xuhong Wang

Links

Abstract / PDF / Code

Why It Matters For Business

If you serve MoE models or long-context LLMs, KV cache memory and network traffic are major cost drivers. PiKV helps cut per-GPU memory and inter-GPU bandwidth by sharding, selective access, and compression—reducing infrastructure needs or enabling longer contexts on the same cluster.

Who Should Care

ML Engineer Engineering Lead CTO Founder

Summary TLDR

PiKV is an open-source system that treats the KV cache for Mixture-of-Experts (MoE) models as a distributed, query-driven service. It shards KV storage by expert across GPUs, routes tokens to a small set of experts per query, applies modular compression, and uses activity-aware scheduling to keep only useful entries in fast memory. The prototype runs on up to 16 GPUs and integrates LoRA/PyramidKV-style compression to cut memory and communication costs.

Problem Statement

MoE models keep dense, globally-shared KV caches which blow up memory and cross‑GPU communication as context and expert counts grow. This causes high latency and makes long-context MoE inference costly or infeasible on typical GPU clusters.

Main Contribution

Expert-sharded KV layout: assign KV shards to GPUs so each device stores only a fraction of the global cache.

Sparse, cache-aware routing: route each query to a small top-k subset of experts to avoid touching the whole cache.

Key Findings

Expert-sharded storage changes per-device memory scaling from proportional to E·L to proportional to L/G + L/E (analytic).

Numbersper-device memory: O(E·L) -> O(L/G + L/E) (Section 3.1 analytic)

Practical UseShard KV by expert across GPUs to cut per‑GPU KV footprint. Use the paper's closed-form to pick shard capacity S for your token budget and compression ratio.

Evidence RefSection 3.1

A concrete example: a 7B MoE with 128K context and 16 experts yields a full KV cache >24 GB.

Numbers7B model, 128K context, 16 experts → full KV >24 GB (Intro)

Practical UseLong-context MoE models can exceed single‑GPU memory. Apply sharding + compression to avoid large cross‑GPU transfers or OOMs.

Evidence RefIntroduction

What To Try In 7 Days

Clone PiKV repo and run prototype on a small multi-GPU node (link in code_urls).

Measure baseline KV memory and cross‑GPU traffic for your MoE model, then enable expert-sharding only to compare.

Enable one compression module (LoRA or PyramidKV) and check memory vs decoding latency trade-offs.

Agent Features

Memory

expert-sharded KV cachecircular buffer per shard

Planning

query-aware stream scheduling

Tool Use

LoRAPyramidKVDuoNCCL/TCP for communication

Frameworks

PiKV

Architectures

MoE

Optimization Features

Token Efficiency

reduces token-to-KV access by routing to k ≪ E experts

Infra Optimization

reduces cross-GPU communication and memory pressuredesign rules for shard capacity S to minimize per-GPU memory

System Optimization

shard placement and circular buffers for O(1) insertionsasynchronous pipeline to overlap routing/compression/IO

Inference Optimization

expert-sharded KV storage to reduce per-GPU memorysparse expert routing to avoid large KV readsmodular compression to reduce KV bytesquery-aware scheduling to retain high-utility pages

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/NoakLiu/PiKV

Risks & Boundaries

Limitations

Prototype evaluated up to 16 GPUs; behavior on hundreds of GPUs is untested.

Paper gives analytic and design rules but provides few detailed end-to-end quantitative benchmarks in text.

When Not To Use

You are not using an MoE model (no expert sparsity gains).

Cluster has abundant GPU memory and bandwidth for full dense KV caches (simple replication is cheaper).

Failure Modes

Evicting or compressing high-utility KV pages can degrade generation quality if scheduler or compressor mis-estimates utility.

Routing mistakes (wrong top-k experts) cause missed KV entries and worse attention context.

Reduce KV memory and bandwidth for MoE inference by sharding, routing, compression, and adaptive scheduling

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Expert-sharded storage changes per-device memory scaling from proportional to E·L to proportional to L/G + L/E (analytic).

A concrete example: a 7B MoE with 128K context and 16 experts yields a full KV cache >24 GB.

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Context Entities

Models

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Expert-sharded storage changes per-device memory scaling from proportional to E·L to proportional to L/G + L/E (analytic).

A concrete example: a 7B MoE with 128K context and 16 experts yields a full KV cache >24 GB.

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Context Entities

Models

You May Also Want to Read

Skip 25–30% of expensive FFN blocks to speed decoding while keeping knowledge accuracy

Key finding

KV-CoRE: an SVD-based tool and benchmark that measures how compressible LLM KV-caches are, per layer and per dataset.

Key finding

Share the common KV cache across LoRA-adapted agents and keep tiny low-rank adapters to cut memory and speed up multi-agent inference.

Key finding

KV-cache compression breaks attention routing: reachability, a 90% safety cliff, and two failure modes

Key finding

Use per-token unstructured pruning + a bitmap sparse kernel to cut KV cache to ~45% size and speed decoding up to 2.23×

Key finding