Overview
Production Readiness
0.7
Novelty Score
0.5
Cost Impact Score
0.7
Citation Count
0
Why It Matters For Business
This paper shows practical, deployable techniques to run big LLMs on commodity x86 servers. That reduces reliance on GPUs, lowers memory barriers for long contexts and large batches, and can cut latency by multi-socket scaling patterns.
Summary TLDR
Practical engineering recipe to speed up LLM inference on x86 CPUs. Introduces SlimAttention (1-D attention split) for lower attention-layer latency, an INT8 KV-cache format with per-token-per-head scaling plus a hybrid INT8->FP32 kernel to cut KV memory, and a oneCCL-based distributed scheme (broadcast token IDs, reduce top-k, zero-copy) that gives multi-socket latency gains.
Problem Statement
GPUs are not always available or cost-effective. Running large LLMs on CPUs needs engineering changes to cut memory use and improve latency/throughput while preserving output quality.
Main Contribution
SlimAttention: a one-dimensional decomposition of attention that reduces per-layer latency on CPUs.
INT8 KV-cache with a unique scale per token and head plus a custom hybrid INT8->FP32 MatMul kernel using AVX512.
Distributed CPU inference design based on oneCCL: broadcast token IDs, reduce top-k, and a zero-copy communication path.
Open-source implementation (repository referenced) and tuned optimizations for common LLMs (Qwen, Llama, ChatGLM, Baichuan, Opt).
Key Findings
SlimAttention greatly lowers per-attention-layer time versus FlashAttention on CPU.
Distributed CPU setup (oneCCL) reduced next-token generation latency for Llama2-70B.
KV cache can dominate memory for large batches and long contexts; authors implement INT8 KV storage with per-token-per-head scaling and a hybrid compute kernel.
Results
next-token latency (Llama2-70B)
attention layer time (FlashAttention vs SlimAttention)
throughput (Llama2-7B, without first token)
KV cache size (example)
Who Should Care
What To Try In 7 Days
Integrate SlimAttention for CPU-backed attention layers and measure per-attention latency on your CPU fleet.
Prototype INT8 KV cache with per-token-per-head scales and the hybrid kernel to check memory savings and output quality.
Adopt the oneCCL pattern: broadcast token IDs, do per-worker top-k, then reduce; implement zero-copy writes for your comm path.
Optimization Features
Token Efficiency
- reduce KV cache memory to enable longer contexts or larger batches
Infra Optimization
- multi-socket scaling across Xeon machines
- custom kernels tuned for x86 AVX512
Model Optimization
- model-specific layer optimizations for Qwen, Llama, ChatGLM, Baichuan, Opt
System Optimization
- oneCCL-based distributed inference
- AVX512 FMA usage
- zero-copy integration between compute and communication
Inference Optimization
- SlimAttention (1-D attention decomposition)
- FlashAttention comparison
- INT8 KV-cache with per-token-per-head scaling
- Hybrid INT8->FP32 MatMul kernel (AVX512 intrinsics)
- zero-copy communication buffer writes
- broadcast token IDs + reduce top-k pattern
Reproducibility
Code Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Experiments run on one CPU family (Intel Xeon 8563C); results may differ on other CPUs without AVX512.
- INT8 KV-cache precision claims are not supported by explicit accuracy/quality numbers in the paper.
- Benchmarks focus on Llama2 variants; generality to other model families and MoE models is untested.
When Not To Use
- If you have abundant GPU resources that already meet latency/throughput targets.
- On CPU hardware without AVX512 or similar vector instructions required by the custom kernels.
- When you require formal guarantees on numeric parity; INT8 KV caching may change numerical outputs.
Failure Modes
- INT8 KV storage can reduce numeric fidelity if per-token/head scaling is miscomputed or underflow/overflow occurs.
- SlimAttention requires larger intermediate buffers which may cause memory pressure at very large context sizes.
- Distributed gains may be limited by network latency or non-optimized communication stacks despite oneCCL patterns.
Core Entities
Models
- Llama2-7B
- Llama2-70B
- Qwen
- ChatGLM
- Baichuan
- Opt
Metrics
- next-token latency (ms)
- attention layer time (ms)
- throughput (tokens/s)
- KV cache size (GB)
Context Entities
Models
- Transformer decoder-only models

