Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
0
Why It Matters For Business
If you serve MoE models or long-context LLMs, KV cache memory and network traffic are major cost drivers. PiKV helps cut per-GPU memory and inter-GPU bandwidth by sharding, selective access, and compression—reducing infrastructure needs or enabling longer contexts on the same cluster.
Summary TLDR
PiKV is an open-source system that treats the KV cache for Mixture-of-Experts (MoE) models as a distributed, query-driven service. It shards KV storage by expert across GPUs, routes tokens to a small set of experts per query, applies modular compression, and uses activity-aware scheduling to keep only useful entries in fast memory. The prototype runs on up to 16 GPUs and integrates LoRA/PyramidKV-style compression to cut memory and communication costs.
Problem Statement
MoE models keep dense, globally-shared KV caches which blow up memory and cross‑GPU communication as context and expert counts grow. This causes high latency and makes long-context MoE inference costly or infeasible on typical GPU clusters.
Main Contribution
Expert-sharded KV layout: assign KV shards to GPUs so each device stores only a fraction of the global cache.
Sparse, cache-aware routing: route each query to a small top-k subset of experts to avoid touching the whole cache.
Modular compression pipeline: plug in multiple KV compressors (LoRA, PyramidKV, Duo) and track reconstruction error.
Query-aware stream scheduler: score KV pages by attention/use and evict low-utility pages under memory budgets.
Open-source prototype with Nvidia kvpress integration and practical design rules for shard sizing.
Key Findings
Expert-sharded storage changes per-device memory scaling from proportional to E·L to proportional to L/G + L/E (analytic).
A concrete example: a 7B MoE with 128K context and 16 experts yields a full KV cache >24 GB.
Prototype and feature support: PiKV was implemented and tested up to 16 GPUs and supports LoRA, PyramidKV, Duo compression.
Who Should Care
What To Try In 7 Days
Clone PiKV repo and run prototype on a small multi-GPU node (link in code_urls).
Measure baseline KV memory and cross‑GPU traffic for your MoE model, then enable expert-sharding only to compare.
Enable one compression module (LoRA or PyramidKV) and check memory vs decoding latency trade-offs.
Agent Features
Memory
- expert-sharded KV cache
- circular buffer per shard
Planning
- query-aware stream scheduling
Tool Use
- LoRA
- PyramidKV
- Duo
- NCCL/TCP for communication
Frameworks
- PiKV
Architectures
- MoE
Optimization Features
Token Efficiency
- reduces token-to-KV access by routing to k ≪ E experts
Infra Optimization
- reduces cross-GPU communication and memory pressure
- design rules for shard capacity S to minimize per-GPU memory
System Optimization
- shard placement and circular buffers for O(1) insertions
- asynchronous pipeline to overlap routing/compression/IO
Inference Optimization
- expert-sharded KV storage to reduce per-GPU memory
- sparse expert routing to avoid large KV reads
- modular compression to reduce KV bytes
- query-aware scheduling to retain high-utility pages
Reproducibility
Code Urls
Code Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Prototype evaluated up to 16 GPUs; behavior on hundreds of GPUs is untested.
- Paper gives analytic and design rules but provides few detailed end-to-end quantitative benchmarks in text.
- Compression introduces reconstruction error; trade-offs are described analytically but not comprehensively validated.
When Not To Use
- You are not using an MoE model (no expert sparsity gains).
- Cluster has abundant GPU memory and bandwidth for full dense KV caches (simple replication is cheaper).
- Application cannot tolerate potential fidelity loss from aggressive compression or eviction.
Failure Modes
- Evicting or compressing high-utility KV pages can degrade generation quality if scheduler or compressor mis-estimates utility.
- Routing mistakes (wrong top-k experts) cause missed KV entries and worse attention context.
- Prototype performance may not scale linearly to very large clusters due to untested network or coordination cost.
Core Entities
Models
- MoE
Metrics
- per-GPU memory (analytic)
- KV read/decode time vs compression ratio
- reuse-distance
Context Entities
Models
- 7B-scale MoE (example)

