Faster CPU inference: SlimAttention, INT8 KV cache, and oneCCL-based distributed serving

Overview

Decision SnapshotNeeds Validation

Paper presents concrete engineering methods with CPU benchmarks on Intel Xeon 8563C. Results show clear latency and throughput wins on tested models. Evidence is limited to one CPU family, several Llama variants, and lacks quantified accuracy/precision numbers for INT8 KV cache.

Citations0

Evidence Strength0.60

Confidence0.70

Risk Signals9

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 3/4

Reproducibility

Status: Partial assets available

Open source: Yes

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 50%

Authors

Pujiang He, Shan Zhou, Wenhuan Huang, Changqing Li, Duyi Wang, Bin Guo, Chen Meng, Sheng Gui, Weifei Yu, Yi Xie

Links

Abstract / PDF / Code

Why It Matters For Business

This paper shows practical, deployable techniques to run big LLMs on commodity x86 servers. That reduces reliance on GPUs, lowers memory barriers for long contexts and large batches, and can cut latency by multi-socket scaling patterns.

Who Should Care

ML Engineer Engineering Lead CTO Product Manager Founder

Summary TLDR

Practical engineering recipe to speed up LLM inference on x86 CPUs. Introduces SlimAttention (1-D attention split) for lower attention-layer latency, an INT8 KV-cache format with per-token-per-head scaling plus a hybrid INT8->FP32 kernel to cut KV memory, and a oneCCL-based distributed scheme (broadcast token IDs, reduce top-k, zero-copy) that gives multi-socket latency gains.

Problem Statement

GPUs are not always available or cost-effective. Running large LLMs on CPUs needs engineering changes to cut memory use and improve latency/throughput while preserving output quality.

Main Contribution

SlimAttention: a one-dimensional decomposition of attention that reduces per-layer latency on CPUs.

INT8 KV-cache with a unique scale per token and head plus a custom hybrid INT8->FP32 MatMul kernel using AVX512.

Key Findings

SlimAttention greatly lowers per-attention-layer time versus FlashAttention on CPU.

NumbersInput=1024: Flash 61.57 ms vs Slim 16.02 ms (per layer, first token)

Practical UseUse SlimAttention for CPU inference to cut attention-layer latency (≈3.8× at 1024 tokens); trade-off: needs larger intermediate buffer but avoids redundant work.

Evidence RefTable 3

Distributed CPU setup (oneCCL) reduced next-token generation latency for Llama2-70B.

NumbersLlama2-70B next-token latency: 2 sockets 249.7 ms → 8 sockets 87.7 ms (2.85× speedup)

Practical UseScale across machines/sockets using the authors' oneCCL pattern (broadcast IDs, reduce top-k) to cut end-to-end latency for very large models.

Evidence RefTable 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
next-token latency (Llama2-70B)	2 sockets: 249.7 ms; 8 sockets: 87.7 ms	2 sockets: 249.7 ms	2.85× faster (8 sockets vs 2 sockets)	Llama2-70B, input=1024, output=128, batch=1	Table 2 in paper	Table 2
attention layer time (FlashAttention vs SlimAttention)	Input=1024: Flash 61.57 ms → Slim 16.02 ms	FlashAttention 61.57 ms	≈3.84× faster (per attention layer, first token)	Llama2-7B, batch=1, measuring first-token attention time	Table 3 in paper	Table 3

What To Try In 7 Days

Integrate SlimAttention for CPU-backed attention layers and measure per-attention latency on your CPU fleet.

Prototype INT8 KV cache with per-token-per-head scales and the hybrid kernel to check memory savings and output quality.

Adopt the oneCCL pattern: broadcast token IDs, do per-worker top-k, then reduce; implement zero-copy writes for your comm path.

Optimization Features

Token Efficiency

reduce KV cache memory to enable longer contexts or larger batches

Infra Optimization

multi-socket scaling across Xeon machinescustom kernels tuned for x86 AVX512

Model Optimization

model-specific layer optimizations for Qwen, Llama, ChatGLM, Baichuan, Opt

System Optimization

oneCCL-based distributed inferenceAVX512 FMA usagezero-copy integration between compute and communication

Inference Optimization

SlimAttention (1-D attention decomposition)FlashAttention comparisonINT8 KV-cache with per-token-per-head scalingHybrid INT8->FP32 MatMul kernel (AVX512 intrinsics)zero-copy communication buffer writesbroadcast token IDs + reduce top-k pattern

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/intel/xFasterTransformer

Risks & Boundaries

Limitations

Experiments run on one CPU family (Intel Xeon 8563C); results may differ on other CPUs without AVX512.

INT8 KV-cache precision claims are not supported by explicit accuracy/quality numbers in the paper.

When Not To Use

If you have abundant GPU resources that already meet latency/throughput targets.

On CPU hardware without AVX512 or similar vector instructions required by the custom kernels.

Failure Modes

INT8 KV storage can reduce numeric fidelity if per-token/head scaling is miscomputed or underflow/overflow occurs.

SlimAttention requires larger intermediate buffers which may cause memory pressure at very large context sizes.

Core Entities

Models

Llama2-7BLlama2-70BQwenChatGLMBaichuanOpt

Metrics

next-token latency (ms)attention layer time (ms)throughput (tokens/s)KV cache size (GB)

Context Entities

Models

Transformer decoder-only models

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

SlimAttention greatly lowers per-attention-layer time versus FlashAttention on CPU.

Distributed CPU setup (oneCCL) reduced next-token generation latency for Llama2-70B.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Context Entities

Models

You May Also Want to Read

Make LLM judges reliable and cheaper: EIF and tuned PPI give tighter, valid evaluation with few labels

Key finding

Pipeline that combines synthetic-data distillation, LoRA, Muon and GPTQ to make task-specialized LLMs fit on edge devices

Key finding

Train a tiny 'judge' on top of target embeddings to accept many more draft tokens and speed up large-model generation up to ~9× without loss

Key finding

Skip 25–30% of expensive FFN blocks to speed decoding while keeping knowledge accuracy

Key finding

Practical survey of quantization, pruning, distillation, and decoding tricks to make LLMs cheaper and faster

Key finding