158 papers found

Atom: 4-bit weight+activation quantization that boosts LLM serving throughput up to 7.7× with minimal accuracy loss

0.70
0.60
0.80
23

Atom can multiply token throughput per GPU and shrink KV-cache memory, lowering cloud GPU costs or increasing capacity without major task-accuracy loss.

Key finding

Atom increases end-to-end serving throughput up to 7.73× vs FP16 and 2.53× vs INT8 under similar latency targets.

Numbers: 7.73× vs FP16; 2.53× vs INT8 (paper Figure 10 / §5.3.2)

Train quantized LLMs without original data and quantize KV cache to reach practical 4-bit weights

0.70
0.60
0.80
15

LLM-QAT lets you reduce model memory and improve throughput by quantizing weights and KV cache to 4 bits while keeping quality close to full precision, which can lower hosting cost and enable longer contexts on the same hardware.

Key finding

Data-free generated samples fine-tuned with logits distillation outperform real-data finetuning for zero-shot tasks.

Numbers: Generated-data (hybrid sampling) avg zero-shot 63.1 vs C4 61.5 (Table 3)

Cut KV-cache memory up to 5× at test time by storing only the persistently important tokens

0.70
0.60
0.80
10

Shrinking KV cache at inference increases batch size and throughput on fixed-memory GPUs without retraining, cutting hosting cost or enabling longer contexts on the same hardware.

Key finding

KV cache can be several times larger than model weights and becomes the memory bottleneck.

Numbers: KV cache 2.5 larger than weights (Table 1; e.g., OPT-175B weights 325GB vs KV 1152GB at batch128, seq2048)

Compress KV cache to sub-4-bit with <0.1 PPL loss and enable million‑to‑10M token inference

0.70
0.70
0.80
9

Cut KV cache memory 3–7× and preserve accuracy so you can serve much longer contexts on existing GPUs, reducing infrastructure cost or enabling new long-document features.

Key finding

3-bit KV cache with 1% sparse outliers keeps perplexity near fp16 on Wikitext-2

Numbers: LLaMA-7B PPL 5.75 vs fp16 5.68 (+0.07)

Learned token-dropping that prunes up to 80% of context to speed and shrink autoregressive Transformers

0.70
0.60
0.75
8

You can cut inference memory and often double throughput on long prompts by fine-tuning a small module, reducing costs for batched long-context services.

Key finding

The fine-tuned model can remove ~80% of prior tokens with almost no perplexity loss.

Numbers: 80.35% sparsity → −0.085 avg perplexity (context=1000) vs dense

BurstGPT: 10.3M real-world LLM traces (Azure) to test and tune serving systems

0.70
0.40
0.60
7

Using real LLM traces uncovers burst-driven failures and KV-cache pressure that synthetic tests miss, letting teams fix reliability and reduce wasted GPU costs before production.

Key finding

Real LLM traces are highly bursty and differ from common cloud/function workloads.

Numbers: Mean RPS: MAF 1.64 vs LLM conv 0.019, LLM API 0.21 (ChatGPT)

Serve thousands of LoRA adapters from one machine by paging adapters and batching LoRA compute

0.80
0.60
0.80
7

If you sell many small fine-tuned models (per-user or per-task), S-LoRA lets one machine host thousands of adapters, cutting GPU costs and raising throughput compared to naive merging or swapping.

Key finding

S-LoRA scales to thousands of adapters on one machine.

Numbers: Served 2,000 adapters on a single A100 (80GB) in experiments

W4A8KV4 (4-bit weight, 8-bit activation, 4-bit KV) plus system kernels to double LLM serving throughput on common GPUs

0.80
0.60
0.80
6

QServe turns 4-bit quantization into real GPU speedups and memory savings, cutting serving cost per token (authors report ~3× dollar cost reduction by using L40S+QServe versus A100+TensorRT-LLM).

Key finding

Prior 4-bit methods incur large runtime dequantization overhead on GPUs.

Numbers: 2090% runtime overhead reported for dequantization

Keyformer halves KV cache by keeping only 'key' tokens, doubling token throughput with no fine-tuning

0.80
0.65
0.80
6

Keyformer cuts memory traffic and latency for long-context generation without retraining, lowering inference cost and enabling higher throughput on existing GPU servers.

Key finding

Attention concentrates on a small subset of tokens ("key tokens").

Numbers: ≈90% of attention mass on ~40% of tokens (Fig.3b)

DeepSpeed-FastGen: up to 2.3x effective throughput and much lower tail latency for LLM serving

0.80
0.45
0.65
5

FastGen can double effective throughput and sharply cut tail latency on long-prompt, streaming chat workloads, which lowers GPU cost per successful request and improves user-perceived responsiveness.

Key finding

Effective throughput improved up to 2.3× versus vLLM under chat-style SLAs.

Numbers: up to 2.3x effective throughput (Section 4.3)

Compress LLM KV cache up to 10× with per-token, outlier-aware quantization and little accuracy loss

0.70
0.45
0.80
3

QAQ can cut the GPU memory used by KV caches by ~8–10×, enabling longer-context features or reducing GPU requirements and cost with little accuracy loss.

Key finding

Key vectors are more sensitive to quantization than value vectors; uniform 2-bit quantization harms keys far more than values.

Numbers: LLaMA-2-7B experiment: 2-bit value quantization retains high accuracy; 2-bit key quantization causes large accuracy drop

Make LLM inference fully 4-bit by rotating away activation outliers

0.80
0.70
0.80
3

QuaRot makes production LLM inference much cheaper and memory-light by enabling true end-to-end 4-bit execution and large KV cache compression, so hosting large models on cheaper GPUs or smaller clusters becomes practical.

Key finding

4-bit end-to-end quantization on LLAMA2-70B with small accuracy loss

Numbers: WikiText-2 PPL +0.47 (3.323.79); zero-shot avg drop ~1.09 pts

MiniCache: merge adjacent layers' KV caches to cut memory and speed up LLM inference

0.80
0.60
0.80
3

MiniCache cuts KV cache memory by up to 41% and can raise throughput ~5× without retraining, enabling lower GPU costs, larger batches, and longer contexts for production LLM services.

Key finding

Up to 5.02× KV cache compression when combined with 4-bit KV quantization.

Numbers: 5.02× compression (Table 1, LongBench average)

MemServe: a MemPool that adds context caching to disaggregated LLM serving, cutting job times and first-token delays

0.70
0.70
0.80
2

MemServe lets operators combine caching and disaggregated inference to cut end-to-end latency and tail times for many chat and long-context workloads, lowering hardware cost per request and improving user-perceived responsiveness.

Key finding

Disaggregated inference plus MemPool reduces job completion time compared to colocated baseline.

Numbers: JCT improved up to 42% (P99) on ShareGPT

Quantize weights + KV cache (not all activations) to save large memory with much less accuracy loss

0.70
0.60
0.80
2

WKVQuant cuts decoding memory of 13B models from ~27GB to ~7GB while keeping accuracy near full-precision; this enables cheaper GPU options and larger batch/sequence support without retraining.

Key finding

WKVQuant (W4KV4) maintains long-input task performance close to full precision and weight-only quantization while far outperforming weight+activation (W4A4) on long-context tasks.

Numbers: LLaMA-2-13B Longtext avg: FP16 34.12, GPTQ W4 34.06, OmniQuant W4A4 16.35, WKVQuant W4KV4 32.52

Keep early 'pivot' tokens' KV cache full‑precision to cut quantization error and restore LLM accuracy

0.70
0.60
0.80
2

A tiny precomputed full‑precision KV prefix fixes a large fraction of quality loss from 3–4 bit quantization, enabling cheaper LLM serving while keeping near‑full performance.

Key finding

Pivot tokens produce very large activation peaks and create attention sinks that dominate attention scores.

Numbers: activation peaks >1e3 at pivot channels

FlashInfer: a JIT‑compiled, block‑sparse attention engine that cuts LLM inference latency and supports custom attention variants

0.80
0.60
0.75
2

FlashInfer can cut inference latency and increase throughput in production LLM services, lowering GPU costs per query and improving user responsiveness.

Key finding

FlashInfer reduces inter‑token latency versus a Triton backend in LLM serving

Numbers: 2969% ITL reduction (Sec. 4, Abstract)

Compress long contexts into cached activations (beacons) to cut KV memory 8x and speed inference ~2x while keeping quality

0.70
0.60
0.70
2

Cuts serving memory by up to 8x and halves latency on long inputs while keeping task quality, letting teams process far larger documents at lower GPU cost.

Key finding

Compression preserves generation quality on evaluated long-context benchmarks.

Numbers: Single-Doc: Ours 34.9 vs Full-FT 34.8 (LongBench Table 1)

ChunkAttention: share KV cache by chunking prompt prefixes to speed self-attention 3.2–4.8×

0.85
0.70
0.80
1

If many requests reuse the same system prompt, ChunkAttention cuts attention latency and KV memory dramatically, letting you serve more users from the same GPUs or reduce cloud costs.

Key finding

Self-attention kernel becomes 3.2–4.8× faster on A100 when many requests share long prompt prefixes.

Numbers: kernel speedup 3.24.8× (ns=1024..4096)

Save and reuse attention KV caches across turns to cut LLM serving latency and cloud cost

0.80
0.60
0.85
1

If your product uses chat or multi-turn flows, caching KV states and overlapping cache IO with GPU work can cut latency and cloud GPU costs dramatically.

Key finding

Time-to-first-token (TTFT) drops dramatically when cached KV hits occur

Numbers: TTFT reduced by up to 87% (Figure 14)

Quest speeds long-context LLM decoding by loading only the KV cache pages likely relevant to the current query

0.70
0.70
0.80
1

Quest reduces memory bandwidth and decode latency for very long-context LLM calls, lowering GPU cost per request and improving responsiveness for document-heavy applications.

Key finding

Quest achieves large self-attention speedups by loading only top-K pages instead of the full KV cache.

Numbers: 7.03× self-attention speedup at 32K seq, token budget 2048

Reuse multimodal KV caches at any position to cut first-token latency and double serving throughput

0.60
0.60
0.70
1

For multimodal services that reuse the same images or files, MPIC can cut prefill latency roughly in half and double serving throughput, lowering per-request compute and improving capacity without changing model weights.

Key finding

MPIC-32 reduces Time-to-First-Token (TTFT) by up to 54.1% versus prefix caching.

Numbers: TTFT reduced up to 54.1% (Fig.9; §5.2)

Stream KV-cache to cut pipeline bubbles, reduce GPU memory, and recover fast for pipeline-parallel LLMs

0.75
0.60
0.70
1

DéjàVu cuts wasted GPU time, reduces memory needs, and shortens recovery after node failures—so you can serve larger LLMs cheaper and more reliably in pipeline-parallel clusters.

Key finding

Disaggregation increases throughput versus a pipeline-parallel baseline

Numbers: Up to throughput improvement vs FasterTransformer on OPT-66B and BLOOM-176B

Compress KV cache by low-rank SVD on KV weight matrices with a layerwise progressive rule

0.80
0.60
0.80
1

LoRC halves KV cache memory in many LLaMA deployments with near-zero impact on accuracy, lowering GPU cost and enabling larger batches or longer contexts on the same hardware.

Key finding

LoRC reduces KV cache size by about 55–60% while keeping average performance drop under 1% on evaluated tasks.

Numbers: 55%–60% compression; avg perf drop <1%