Reduce KV memory and bandwidth for MoE inference by sharding, routing, compression, and adaptive scheduling

August 2, 20256 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

0

Authors

Dong Liu, Yanxuan Yu, Ben Lengerich, Ying Nian Wu, Xuhong Wang

Links

Abstract / PDF

Why It Matters For Business

If you serve MoE models or long-context LLMs, KV cache memory and network traffic are major cost drivers. PiKV helps cut per-GPU memory and inter-GPU bandwidth by sharding, selective access, and compression—reducing infrastructure needs or enabling longer contexts on the same cluster.

Summary TLDR

PiKV is an open-source system that treats the KV cache for Mixture-of-Experts (MoE) models as a distributed, query-driven service. It shards KV storage by expert across GPUs, routes tokens to a small set of experts per query, applies modular compression, and uses activity-aware scheduling to keep only useful entries in fast memory. The prototype runs on up to 16 GPUs and integrates LoRA/PyramidKV-style compression to cut memory and communication costs.

Problem Statement

MoE models keep dense, globally-shared KV caches which blow up memory and cross‑GPU communication as context and expert counts grow. This causes high latency and makes long-context MoE inference costly or infeasible on typical GPU clusters.

Main Contribution

Expert-sharded KV layout: assign KV shards to GPUs so each device stores only a fraction of the global cache.

Sparse, cache-aware routing: route each query to a small top-k subset of experts to avoid touching the whole cache.

Modular compression pipeline: plug in multiple KV compressors (LoRA, PyramidKV, Duo) and track reconstruction error.

Query-aware stream scheduler: score KV pages by attention/use and evict low-utility pages under memory budgets.

Open-source prototype with Nvidia kvpress integration and practical design rules for shard sizing.

Key Findings

Expert-sharded storage changes per-device memory scaling from proportional to E·L to proportional to L/G + L/E (analytic).

Numbersper-device memory: O(E·L) -> O(L/G + L/E) (Section 3.1 analytic)

A concrete example: a 7B MoE with 128K context and 16 experts yields a full KV cache >24 GB.

Numbers7B model, 128K context, 16 experts → full KV >24 GB (Intro)

Prototype and feature support: PiKV was implemented and tested up to 16 GPUs and supports LoRA, PyramidKV, Duo compression.

Numbersprototype evaluated on up to 16 GPUs; supports LoRA/PyramidKV/Duo (Abstract, Conclusion)

Who Should Care

What To Try In 7 Days

Clone PiKV repo and run prototype on a small multi-GPU node (link in code_urls).

Measure baseline KV memory and cross‑GPU traffic for your MoE model, then enable expert-sharding only to compare.

Enable one compression module (LoRA or PyramidKV) and check memory vs decoding latency trade-offs.

Agent Features

Memory

  • expert-sharded KV cache
  • circular buffer per shard

Planning

  • query-aware stream scheduling

Tool Use

  • LoRA
  • PyramidKV
  • Duo
  • NCCL/TCP for communication

Frameworks

  • PiKV

Architectures

  • MoE

Optimization Features

Token Efficiency

  • reduces token-to-KV access by routing to k ≪ E experts

Infra Optimization

  • reduces cross-GPU communication and memory pressure
  • design rules for shard capacity S to minimize per-GPU memory

System Optimization

  • shard placement and circular buffers for O(1) insertions
  • asynchronous pipeline to overlap routing/compression/IO

Inference Optimization

  • expert-sharded KV storage to reduce per-GPU memory
  • sparse expert routing to avoid large KV reads
  • modular compression to reduce KV bytes
  • query-aware scheduling to retain high-utility pages

Reproducibility

Code Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Prototype evaluated up to 16 GPUs; behavior on hundreds of GPUs is untested.
  • Paper gives analytic and design rules but provides few detailed end-to-end quantitative benchmarks in text.
  • Compression introduces reconstruction error; trade-offs are described analytically but not comprehensively validated.

When Not To Use

  • You are not using an MoE model (no expert sparsity gains).
  • Cluster has abundant GPU memory and bandwidth for full dense KV caches (simple replication is cheaper).
  • Application cannot tolerate potential fidelity loss from aggressive compression or eviction.

Failure Modes

  • Evicting or compressing high-utility KV pages can degrade generation quality if scheduler or compressor mis-estimates utility.
  • Routing mistakes (wrong top-k experts) cause missed KV entries and worse attention context.
  • Prototype performance may not scale linearly to very large clusters due to untested network or coordination cost.

Core Entities

Models

  • MoE

Metrics

  • per-GPU memory (analytic)
  • KV read/decode time vs compression ratio
  • reuse-distance

Context Entities

Models

  • 7B-scale MoE (example)