Overview
Paper presents concrete engineering methods with CPU benchmarks on Intel Xeon 8563C. Results show clear latency and throughput wins on tested models. Evidence is limited to one CPU family, several Llama variants, and lacks quantified accuracy/precision numbers for INT8 KV cache.
Citations0
Evidence Strength0.60
Confidence0.70
Risk Signals9
Trust Signals
Findings with numeric evidence: 3/3
Findings with evidence refs: 3/3
Results with explicit delta: 3/4
Reproducibility
Status: Partial assets available
Open source: Yes
At A Glance
Cost impact: 70%
Production readiness: 70%
Novelty: 50%
Why It Matters For Business
This paper shows practical, deployable techniques to run big LLMs on commodity x86 servers. That reduces reliance on GPUs, lowers memory barriers for long contexts and large batches, and can cut latency by multi-socket scaling patterns.
Who Should Care
Summary TLDR
Practical engineering recipe to speed up LLM inference on x86 CPUs. Introduces SlimAttention (1-D attention split) for lower attention-layer latency, an INT8 KV-cache format with per-token-per-head scaling plus a hybrid INT8->FP32 kernel to cut KV memory, and a oneCCL-based distributed scheme (broadcast token IDs, reduce top-k, zero-copy) that gives multi-socket latency gains.
Problem Statement
GPUs are not always available or cost-effective. Running large LLMs on CPUs needs engineering changes to cut memory use and improve latency/throughput while preserving output quality.
Main Contribution
SlimAttention: a one-dimensional decomposition of attention that reduces per-layer latency on CPUs.
INT8 KV-cache with a unique scale per token and head plus a custom hybrid INT8->FP32 MatMul kernel using AVX512.
Key Findings
SlimAttention greatly lowers per-attention-layer time versus FlashAttention on CPU.
Distributed CPU setup (oneCCL) reduced next-token generation latency for Llama2-70B.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| next-token latency (Llama2-70B) | 2 sockets: 249.7 ms; 8 sockets: 87.7 ms | 2 sockets: 249.7 ms | 2.85× faster (8 sockets vs 2 sockets) | Llama2-70B, input=1024, output=128, batch=1 | Table 2 in paper | Table 2 |
| attention layer time (FlashAttention vs SlimAttention) | Input=1024: Flash 61.57 ms → Slim 16.02 ms | FlashAttention 61.57 ms | ≈3.84× faster (per attention layer, first token) | Llama2-7B, batch=1, measuring first-token attention time | Table 3 in paper | Table 3 |
What To Try In 7 Days
Integrate SlimAttention for CPU-backed attention layers and measure per-attention latency on your CPU fleet.
Prototype INT8 KV cache with per-token-per-head scales and the hybrid kernel to check memory savings and output quality.
Adopt the oneCCL pattern: broadcast token IDs, do per-worker top-k, then reduce; implement zero-copy writes for your comm path.
Optimization Features
Token Efficiency
Infra Optimization
Model Optimization
System Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Experiments run on one CPU family (Intel Xeon 8563C); results may differ on other CPUs without AVX512.
INT8 KV-cache precision claims are not supported by explicit accuracy/quality numbers in the paper.
When Not To Use
If you have abundant GPU resources that already meet latency/throughput targets.
On CPU hardware without AVX512 or similar vector instructions required by the custom kernels.
Failure Modes
INT8 KV storage can reduce numeric fidelity if per-token/head scaling is miscomputed or underflow/overflow occurs.
SlimAttention requires larger intermediate buffers which may cause memory pressure at very large context sizes.

