Faster CPU inference: SlimAttention, INT8 KV cache, and oneCCL-based distributed serving

July 10, 20247 min

Overview

Production Readiness

0.7

Novelty Score

0.5

Cost Impact Score

0.7

Citation Count

0

Authors

Pujiang He, Shan Zhou, Wenhuan Huang, Changqing Li, Duyi Wang, Bin Guo, Chen Meng, Sheng Gui, Weifei Yu, Yi Xie

Links

Abstract / PDF

Why It Matters For Business

This paper shows practical, deployable techniques to run big LLMs on commodity x86 servers. That reduces reliance on GPUs, lowers memory barriers for long contexts and large batches, and can cut latency by multi-socket scaling patterns.

Summary TLDR

Practical engineering recipe to speed up LLM inference on x86 CPUs. Introduces SlimAttention (1-D attention split) for lower attention-layer latency, an INT8 KV-cache format with per-token-per-head scaling plus a hybrid INT8->FP32 kernel to cut KV memory, and a oneCCL-based distributed scheme (broadcast token IDs, reduce top-k, zero-copy) that gives multi-socket latency gains.

Problem Statement

GPUs are not always available or cost-effective. Running large LLMs on CPUs needs engineering changes to cut memory use and improve latency/throughput while preserving output quality.

Main Contribution

SlimAttention: a one-dimensional decomposition of attention that reduces per-layer latency on CPUs.

INT8 KV-cache with a unique scale per token and head plus a custom hybrid INT8->FP32 MatMul kernel using AVX512.

Distributed CPU inference design based on oneCCL: broadcast token IDs, reduce top-k, and a zero-copy communication path.

Open-source implementation (repository referenced) and tuned optimizations for common LLMs (Qwen, Llama, ChatGLM, Baichuan, Opt).

Key Findings

SlimAttention greatly lowers per-attention-layer time versus FlashAttention on CPU.

NumbersInput=1024: Flash 61.57 ms vs Slim 16.02 ms (per layer, first token)

Distributed CPU setup (oneCCL) reduced next-token generation latency for Llama2-70B.

NumbersLlama2-70B next-token latency: 2 sockets 249.7 ms → 8 sockets 87.7 ms (2.85× speedup)

KV cache can dominate memory for large batches and long contexts; authors implement INT8 KV storage with per-token-per-head scaling and a hybrid compute kernel.

NumbersExample: Llama2-7B KVcache ≈128 GB (FP16/BF16) vs weights ≈14 GB at batch=256, seq_in=1024, seq_out=1024

Results

next-token latency (Llama2-70B)

Value2 sockets: 249.7 ms; 8 sockets: 87.7 ms

Baseline2 sockets: 249.7 ms

attention layer time (FlashAttention vs SlimAttention)

ValueInput=1024: Flash 61.57 ms → Slim 16.02 ms

BaselineFlashAttention 61.57 ms

throughput (Llama2-7B, without first token)

Valuebatch=256: 796.9 tokens/s; batch=512: 853.6 tokens/s

KV cache size (example)

ValueKVcache ≈128 GB (FP16/BF16) for batch=256, seq_in=1024, seq_out=1024

Baselinemodel weights ≈14 GB

Who Should Care

What To Try In 7 Days

Integrate SlimAttention for CPU-backed attention layers and measure per-attention latency on your CPU fleet.

Prototype INT8 KV cache with per-token-per-head scales and the hybrid kernel to check memory savings and output quality.

Adopt the oneCCL pattern: broadcast token IDs, do per-worker top-k, then reduce; implement zero-copy writes for your comm path.

Optimization Features

Token Efficiency

  • reduce KV cache memory to enable longer contexts or larger batches

Infra Optimization

  • multi-socket scaling across Xeon machines
  • custom kernels tuned for x86 AVX512

Model Optimization

  • model-specific layer optimizations for Qwen, Llama, ChatGLM, Baichuan, Opt

System Optimization

  • oneCCL-based distributed inference
  • AVX512 FMA usage
  • zero-copy integration between compute and communication

Inference Optimization

  • SlimAttention (1-D attention decomposition)
  • FlashAttention comparison
  • INT8 KV-cache with per-token-per-head scaling
  • Hybrid INT8->FP32 MatMul kernel (AVX512 intrinsics)
  • zero-copy communication buffer writes
  • broadcast token IDs + reduce top-k pattern

Reproducibility

Code Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Experiments run on one CPU family (Intel Xeon 8563C); results may differ on other CPUs without AVX512.
  • INT8 KV-cache precision claims are not supported by explicit accuracy/quality numbers in the paper.
  • Benchmarks focus on Llama2 variants; generality to other model families and MoE models is untested.

When Not To Use

  • If you have abundant GPU resources that already meet latency/throughput targets.
  • On CPU hardware without AVX512 or similar vector instructions required by the custom kernels.
  • When you require formal guarantees on numeric parity; INT8 KV caching may change numerical outputs.

Failure Modes

  • INT8 KV storage can reduce numeric fidelity if per-token/head scaling is miscomputed or underflow/overflow occurs.
  • SlimAttention requires larger intermediate buffers which may cause memory pressure at very large context sizes.
  • Distributed gains may be limited by network latency or non-optimized communication stacks despite oneCCL patterns.

Core Entities

Models

  • Llama2-7B
  • Llama2-70B
  • Qwen
  • ChatGLM
  • Baichuan
  • Opt

Metrics

  • next-token latency (ms)
  • attention layer time (ms)
  • throughput (tokens/s)
  • KV cache size (GB)

Context Entities

Models

  • Transformer decoder-only models