Run full LLM inference on one wafer-scale chip; up to 10–600× speedups vs GPUs

February 6, 20257 min

Overview

Production Readiness

0.7

Novelty Score

0.65

Cost Impact Score

0.8

Citation Count

0

Authors

Congjie He, Yeqi Huang, Pei Mu, Ziming Miao, Jilong Xue, Lingxiao Ma, Fan Yang, Luo Mai

Links

Abstract / PDF

Why It Matters For Business

Wafer-scale chips can cut per-request latency and tokens-per-dollar for long outputs and high-throughput serving, making them worth testing for production LLM serving and cost-sensitive, long-context workloads.

Summary TLDR

LLM inference is bandwidth-bound and GPU-focused systems miss key trade-offs on wafer-scale chips. WaferLLM introduces PLMR, a device model (Parallelism, Latency, Memory, Routing), and three system pieces: wafer-scale parallelism, MeshGEMM, MeshGEMV, plus a shift-based KV cache. Implemented on a Cerebras WSE-2, WaferLLM runs full LLaMA-family models on-chip, yielding large speedups (GEMV up to 606× vs an A100 GPU; end-to-end 10–20× vs optimized A100 clusters) and ~2.5× energy efficiency for long-output workloads. Code is open-sourced.

Problem Statement

Modern LLM runtimes are tuned for shared-memory GPUs. Wafer-scale accelerators use a massive mesh of small cores with local memory and limited routing. This creates huge non-uniform memory latencies and tight per-core memory and routing limits, so GPU-optimized designs waste bandwidth and underutilize wafer chips.

Main Contribution

PLMR device model that summarizes wafer-scale constraints: Parallelism, Latency, Memory, Routing.

Wafer-scale LLM parallelism policy for prefill/decode and a shift-based KV cache to balance cores.

MeshGEMM and MeshGEMV: new GEMM/GEMV algorithms designed to bound communication hops and routing paths.

An implementation on Cerebras WSE-2 with end-to-end evaluation and open-source code.

Key Findings

WaferLLM achieves far higher accelerator utilization than prior methods.

Numbersup to 200× accelerator utilization vs SOTA methods

MeshGEMV vastly accelerates the memory-bound decode step.

NumbersGEMV 606× faster and 16× more energy-efficient vs single A100 (measured)

End-to-end LLM inference is much faster on wafer-scale hardware.

Numbers10–20× faster than optimized SGLang on A100 GPU clusters

Prefill (GEMM) benefits from MeshGEMM.

NumbersMeshGEMM 2–3× faster than SUMMA and Cannon in microbenchmarks

Shift-based KV cache increases token capacity dramatically.

Numberssupports 360–385× more decode tokens than concat-based PagedAttention

Energy efficiency gains hold end-to-end for long outputs.

Numbers~2.5× energy efficiency vs optimal SGLang multi-GPU result

Results

GEMV latency (example)

ValueA100 single GPU 0.336 ms vs MeshGEMV 0.0012 ms

BaselineA100 single GPU

End-to-end LLM throughput (TPR)

ValueWaferLLM TPR example 604.4 vs SGLang 31.1 (LLaMA3-8B, 4096/128)

BaselineSGLang on A100 cluster

Prefill throughput (GEMM)

ValueWaferLLM example 27,686 vs SGLang 13,988 (LLaMA3-8B, 720×720 cores vs 1 A100)

BaselineSGLang on A100

KV cache capacity (max decode length)

ValueConcat 382 vs Shift 137,548 tokens (LLaMA3-8B)

Baselineconcat-based PagedAttention

End-to-end energy efficiency

ValueWaferLLM ≈2–2.5× better energy efficiency vs SGLang optimal multi-GPU

BaselineSGLang optimal multi-GPU

Who Should Care

What To Try In 7 Days

Profile your workload to confirm decode (GEMV) is bandwidth-bound and benefits from on-chip memory.

Simulate shift-based KV layout on current infra to see memory balance gains before hardware access.

Run WaferLLM code or microbenchmarks on a rented WSE instance (or partner) to compare TPR and energy.

Agent Features

Memory

  • local per-core SRAM constrained (48KB/core)

Tool Use

  • CSL on WSE-2
  • autotuning for core counts

Frameworks

  • SGLang
  • T10
  • Ladder

Architectures

  • mesh NoC
  • wafer-scale many-core

Optimization Features

Token Efficiency

  • KV cache shift increases decode capacity

Infra Optimization

  • two-hop interleave communication
  • K-tree allreduce

System Optimization

  • PLMR-guided partitioning
  • fine-grained replication for decode

Inference Optimization

  • MeshGEMM
  • MeshGEMV
  • shift-based KV cache
  • transpose-free placement

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Evaluated only on Cerebras WSE-2; cross-vendor results may vary.
  • Per-core SRAM (48KB) forces pipeline parallelism and some underutilization.
  • Some speedups depend on on-chip NoC properties not present in all accelerators.

When Not To Use

  • If you lack access to a wafer-scale NoC device.
  • For very small models or single-GPU workloads where GPUs are cheaper and easier.
  • When per-request latency for a single short token matters more than long-output throughput.

Failure Modes

  • Routing resource exhaustion if K is chosen too large for hardware limits.
  • Edge-core underutilization causing pipeline bubbles and lower throughput.
  • Autotuning mismatch for variable-length workloads leading to suboptimal core choice.

Core Entities

Models

  • LLaMA3-8B
  • LLaMA2-13B
  • CodeLLaMA-34B
  • QWen2-72B

Metrics

  • TPR (Throughput per Request)
  • TPOT (Time per Output Token)
  • GEMM/GEMV latency (ms)
  • Energy ratio (A100/WSE-2)
  • accelerator utilization

Benchmarks

  • prefill (GEMM) throughput
  • decode (GEMV) throughput
  • end-to-end Throughput per Request (TPR)
  • KV cache max decode length

Context Entities

Models

  • A100 GPU
  • NVLink/InfiniBand multi-GPU clusters
  • Cerebras WSE-3 (future mention)
  • Tesla Dojo (future comparison)

Metrics

  • per-hop latency (α)
  • per-routing latency (β)
  • routing paths per core