Faster CPU inference: SlimAttention, INT8 KV cache, and oneCCL-based distributed serving

July 10, 20247 min

Overview

Decision SnapshotNeeds Validation

Paper presents concrete engineering methods with CPU benchmarks on Intel Xeon 8563C. Results show clear latency and throughput wins on tested models. Evidence is limited to one CPU family, several Llama variants, and lacks quantified accuracy/precision numbers for INT8 KV cache.

Citations0

Evidence Strength0.60

Confidence0.70

Risk Signals9

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 3/4

Reproducibility

Status: Partial assets available

Open source: Yes

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 50%

Authors

Pujiang He, Shan Zhou, Wenhuan Huang, Changqing Li, Duyi Wang, Bin Guo, Chen Meng, Sheng Gui, Weifei Yu, Yi Xie

Links

Abstract / PDF / Code

Why It Matters For Business

This paper shows practical, deployable techniques to run big LLMs on commodity x86 servers. That reduces reliance on GPUs, lowers memory barriers for long contexts and large batches, and can cut latency by multi-socket scaling patterns.

Who Should Care

Summary TLDR

Practical engineering recipe to speed up LLM inference on x86 CPUs. Introduces SlimAttention (1-D attention split) for lower attention-layer latency, an INT8 KV-cache format with per-token-per-head scaling plus a hybrid INT8->FP32 kernel to cut KV memory, and a oneCCL-based distributed scheme (broadcast token IDs, reduce top-k, zero-copy) that gives multi-socket latency gains.

Problem Statement

GPUs are not always available or cost-effective. Running large LLMs on CPUs needs engineering changes to cut memory use and improve latency/throughput while preserving output quality.

Main Contribution

SlimAttention: a one-dimensional decomposition of attention that reduces per-layer latency on CPUs.

INT8 KV-cache with a unique scale per token and head plus a custom hybrid INT8->FP32 MatMul kernel using AVX512.

Key Findings

SlimAttention greatly lowers per-attention-layer time versus FlashAttention on CPU.

NumbersInput=1024: Flash 61.57 ms vs Slim 16.02 ms (per layer, first token)

Practical UseUse SlimAttention for CPU inference to cut attention-layer latency (≈3.8× at 1024 tokens); trade-off: needs larger intermediate buffer but avoids redundant work.

Evidence RefTable 3

Distributed CPU setup (oneCCL) reduced next-token generation latency for Llama2-70B.

NumbersLlama2-70B next-token latency: 2 sockets 249.7 ms → 8 sockets 87.7 ms (2.85× speedup)

Practical UseScale across machines/sockets using the authors' oneCCL pattern (broadcast IDs, reduce top-k) to cut end-to-end latency for very large models.

Evidence RefTable 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
next-token latency (Llama2-70B)2 sockets: 249.7 ms; 8 sockets: 87.7 ms2 sockets: 249.7 ms2.85× faster (8 sockets vs 2 sockets)Llama2-70B, input=1024, output=128, batch=1Table 2 in paperTable 2
attention layer time (FlashAttention vs SlimAttention)Input=1024: Flash 61.57 ms → Slim 16.02 msFlashAttention 61.57 ms≈3.84× faster (per attention layer, first token)Llama2-7B, batch=1, measuring first-token attention timeTable 3 in paperTable 3

What To Try In 7 Days

Integrate SlimAttention for CPU-backed attention layers and measure per-attention latency on your CPU fleet.

Prototype INT8 KV cache with per-token-per-head scales and the hybrid kernel to check memory savings and output quality.

Adopt the oneCCL pattern: broadcast token IDs, do per-worker top-k, then reduce; implement zero-copy writes for your comm path.

Optimization Features

Token Efficiency
reduce KV cache memory to enable longer contexts or larger batches
Infra Optimization
multi-socket scaling across Xeon machinescustom kernels tuned for x86 AVX512
Model Optimization
model-specific layer optimizations for Qwen, Llama, ChatGLM, Baichuan, Opt
System Optimization
oneCCL-based distributed inferenceAVX512 FMA usagezero-copy integration between compute and communication
Inference Optimization
SlimAttention (1-D attention decomposition)FlashAttention comparisonINT8 KV-cache with per-token-per-head scalingHybrid INT8->FP32 MatMul kernel (AVX512 intrinsics)zero-copy communication buffer writesbroadcast token IDs + reduce top-k pattern

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

Experiments run on one CPU family (Intel Xeon 8563C); results may differ on other CPUs without AVX512.

INT8 KV-cache precision claims are not supported by explicit accuracy/quality numbers in the paper.

When Not To Use

If you have abundant GPU resources that already meet latency/throughput targets.

On CPU hardware without AVX512 or similar vector instructions required by the custom kernels.

Failure Modes

INT8 KV storage can reduce numeric fidelity if per-token/head scaling is miscomputed or underflow/overflow occurs.

SlimAttention requires larger intermediate buffers which may cause memory pressure at very large context sizes.

Core Entities

Models

Llama2-7BLlama2-70BQwenChatGLMBaichuanOpt

Metrics

next-token latency (ms)attention layer time (ms)throughput (tokens/s)KV cache size (GB)

Context Entities

Models

Transformer decoder-only models