Efficient Inference Papers — Parsed & Scored for Practitioners

Task‑agnostic structured pruning for LLMs that cuts params and memory, then recovers in hours with 50K samples

0.70

0.60

0.70

73

You can cut ~20% of an LLM's parameters to save memory and speed without losing most zero‑shot ability, and re-tune it in hours on one GPU with modest public data—much cheaper than retraining or distillation.

Key finding

20% parameter pruning can be recovered to near-original zero-shot accuracy with limited tuning

Numbers: Pruned 20% → average accuracy 60.07; baseline 63.25 → 94.97% retained

Jamba: hybrid Transformer + Mamba + MoE that fits long contexts in one 80GB GPU

0.30

0.80

0.70

40

Jamba lets teams process much longer documents on standard GPUs while cutting memory needs and speeding up inference; that reduces infrastructure cost for long-document products and enables new features like huge-context summarization.

Key finding

Hybrid Jamba reduces KV cache for 256K tokens to 4GB.

Numbers: KV cache (256K, 16bit): Jamba 4GB vs Mixtral 32GB vs Llama‑2 128GB

Lossless 3-bit LLM quantization with dense-and-sparse weights

0.80

0.70

0.80

23

SqueezeLLM cuts model storage and single‑request latency by ~2× while keeping near‑FP16 quality, enabling cheaper and faster on‑prem or cloud inference for generative LLMs.

Key finding

3‑bit dense SqueezeLLM on LLaMA‑7B achieves perplexity 7.75 on C4 versus FP16 7.08 and GPTQ 9.55.

Numbers: LLaMA‑7B (3‑bit): SqueezeLLM PPL 7.75, FP16 7.08, GPTQ 9.55

Atom: 4-bit weight+activation quantization that boosts LLM serving throughput up to 7.7× with minimal accuracy loss

0.70

0.60

0.80

23

Atom can multiply token throughput per GPU and shrink KV-cache memory, lowering cloud GPU costs or increasing capacity without major task-accuracy loss.

Key finding

Atom increases end-to-end serving throughput up to 7.73× vs FP16 and 2.53× vs INT8 under similar latency targets.

Numbers: 7.73× vs FP16; 2.53× vs INT8 (paper Figure 10 / §5.3.2)

Fine-tune LLMs directly in low-bit (INT4/INT3/INT2) and deploy the merged quantized model without accuracy loss

0.70

0.40

0.80

21

QA-LoRA lets teams fine-tune large models on far fewer GPUs, produce merged low-bit models, and deploy faster, cheaper INT4 inference without the accuracy loss from post-quantization.

Key finding

QA-LoRA improves few-shot MMLU accuracy vs QLoRA+GPTQ on LLaMA-7B (Alpaca, 4-bit) in the reported experiments.

Numbers: 5-shot avg: QA-LoRA 39.4% vs QLoRA+GPTQ 36.0% (Table 1)

Cluster activation channels by range, reorder them, then quantize—cuts LLM activation memory up to ~80% while keeping accuracy near FP16

0.70

0.80

20

RPTQ reduces memory footprint on large language models by up to ~80%, enabling cheaper hosting, fewer GPUs per deployment, and longer context lengths for customers without retraining models.

Key finding

RPTQ enables low-bit activation quantization with small accuracy loss on large OPT models.

Numbers: OPT-175b: W4A8 perplexity loss <0.5; W4A4 loss <3 (evaluated datasets)

ShortGPT: remove low-impact layers to cut ~25% size while keeping ≈90% of performance

0.60

0.50

0.70

15

ShortGPT offers a low-effort way to cut model size and inference cost by removing low-impact layers; it preserves most classification-style performance and stacks with quantization to save more compute.

Key finding

Removing 10 out of 40 layers (25%) from LLaMA2-13B reduced MMLU from 55.0 to 52.2.

Numbers: MMLU 55.0 -> 52.2 (25% layers removed)

RRTF trains a 15B code LLM by ranking test-and-teacher outputs; PanGu-Coder2 hits ~62% pass@1 on HumanEval

0.70

0.60

0.70

13

RRTF provides a lower-cost, scalable way to improve code-generation correctness by using unit tests and stronger-model outputs as ranked supervision; this delivers higher-quality code models that are faster and cheaper to run after quantization.

Key finding

PanGu-Coder2 achieves state-of-the-art pass@1 among open-source models on HumanEval.

Numbers: pass@1=61.64% (n=200 sampling); greedy pass@1=62.20%

Measured energy and throughput trade-offs for multi‑GPU LLaMA inference on V100/A100

0.60

0.40

0.80

12

LLM inference can cost hundreds to thousands of watts per deployed model instance; choosing GPU type, shard layout, and power caps materially changes operating costs and latency.

Key finding

A100 gives higher throughput but uses more power per second than V100.

Numbers: 7B: ~2× throughput gain on A100; 13B: ~1.25× (Fig.2, Fig.3)

Cut KV-cache memory up to 5× at test time by storing only the persistently important tokens

0.70

0.60

0.80

10

Shrinking KV cache at inference increases batch size and throughput on fixed-memory GPUs without retraining, cutting hosting cost or enabling longer contexts on the same hardware.

Key finding

KV cache can be several times larger than model weights and becomes the memory bottleneck.

Numbers: KV cache 2.5–5× larger than weights (Table 1; e.g., OPT-175B weights 325GB vs KV 1152GB at batch128, seq2048)

Compress KV cache to sub-4-bit with <0.1 PPL loss and enable million‑to‑10M token inference

0.70

0.80

9

Cut KV cache memory 3–7× and preserve accuracy so you can serve much longer contexts on existing GPUs, reducing infrastructure cost or enabling new long-document features.

Key finding

3-bit KV cache with 1% sparse outliers keeps perplexity near fp16 on Wikitext-2

Numbers: LLaMA-7B PPL 5.75 vs fp16 5.68 (+0.07)

Keep a few sensitive weight columns in high precision, quantize the rest to reach ~3 bits with near-4-bit quality and tiny overhead

0.80

0.65

0.82

8

OWQ reduces model storage and keeps accuracy with only tiny runtime and storage overheads, enabling deployment of very large LLMs on fewer GPUs and cheaper hardware.

Key finding

Keeping a small set of sensitive columns in fp16 yields large quality gains over uniform 3-bit quantization.

Numbers: OPT-6.7B WikiText-2: OPTQ 3-bit PPL 12.88 → OWQ 3.01 PPL 11.21

Use FP8 activations and FP4 weights to keep LLM quality while cutting memory and using H100 FP support

0.70

0.60

0.70

8

Switching activations to FP8 and weights to FP4 can cut memory and exploit H100 FP8 hardware while keeping model quality—good for deploying large LLMs on constrained inference servers.

Key finding

FP8 activations beat INT8 activations in perplexity across models, with larger wins for larger models.

Numbers: LLaMA-7b W8A8: PPL 10.63 (INT) → 10.38 (FP); drop 0.25

Compress LLaMA-2 7B to 2.1GB (70% fewer params) with 25% faster inference and ~2–3% accuracy drop

0.60

0.70

0.80

8

CompactifAI can cut model storage and runtime costs, enabling on-prem or cheaper-cloud LLM deployment with modest accuracy trade-offs for many tasks.

Key finding

Memory reduced from 27.1 GB to 2.1 GB (93% reduction) on LlaMA‑2 7B using tensorization plus quantization.

Numbers: 27.1 GB → 2.1 GB (93% reduction)

VeRA shares frozen random matrices and learns tiny scaling vectors to cut finetuning params 10–100× with similar performance

0.80

0.70

0.90

8

VeRA slashes the bytes required per adapted model (10–100× less) so firms can store many personalized or task-specific adapters on the same GPU. That reduces serving costs, speeds model swap-in, and lowers storage and network bandwidth for model variants.

Key finding

On GLUE (RoBERTa-large), VeRA matches LoRA average dev performance while using ≈13× fewer trainable parameters.

Numbers: LoRA 0.8M params avg 87.8 vs VeRA 0.061M params avg 87.8

ZipLM: inference-aware structured pruning that gives runtime speedup guarantees across devices

0.80

0.40

0.80

7

ZipLM cuts inference cost and risk: it produces many valid speedup targets in one run and guarantees measured speedups on target hardware, reducing GPU/CPU time and deployment surprises.

Key finding

ZipLM beats CoFi and TinyBERT on SQuAD at the same speedup.

Numbers: ≈ +3 F1 points vs CoFi at same speedup (SQuAD dev)

Train a cheap router to send 'easy' queries to small models and save cloud cost while keeping quality

0.70

0.60

0.70

6

Route cheap queries to local or smaller models to cut cloud inference costs while keeping user-facing quality high; thresholds let operators trade cost vs quality on demand.

Key finding

The router can route many queries to the small model and keep quality nearly unchanged.

Numbers: 22% fewer large-model calls with <1% BART drop (Llama-2 13b vs GPT-3.5-turbo)

Bonsai: prune large language models using only forward passes to cut memory needs and keep accuracy

0.75

0.70

0.85

6

Bonsai makes structured LLM compression feasible on commodity GPUs, cutting memory needs and producing faster models so teams can reduce inference cost and enable on-device fine-tuning without enterprise hardware.

Key finding

Bonsai cuts pruning memory requirements to inference-only levels, enabling pruning on ≈20GB devices instead of 80–160GB.

Numbers: pruning memory ≈20GB vs 80–160GB for gradient methods

Learn orthonormal rotations to remove outliers and make 4-bit LLMs accurate and fast

0.85

0.70

0.80

6

SpinQuant makes extreme low-bit LLM inference practical: big memory and latency savings with near-full accuracy, using a small calibration step and without changing model APIs.

Key finding

Learned rotations reduce the zero-shot accuracy gap to full precision to 2.9 points on LLaMA-2 7B in W4A4KV4.

Numbers: W4A4KV4 gap = 2.9 points (LLaMA-2 7B)

Cut serverless LLM cold-starts 10–200× by local checkpoint formats, token-only live migration, and startup-aware scheduling

0.70

0.60

0.80

6

Cutting cold-start time from minutes to seconds reduces user-visible latency, lowers GPU idle costs, and increases successful request completion for large models, improving SLA and capacity efficiency.

Key finding

ServerlessLLM's checkpoint loader speeds cold loads 3.6–8.2× over existing loaders

Numbers: 3.6–8.2× faster loading (OPT-2.7B to LLaMA-2-70B)

Reorder table rows and fields to boost LLM prompt-cache reuse and cut latency/costs

0.80

0.60

0.80

6

If you run LLMs over tables in batches, reordering rows and fields can cut inference time and API bills materially by increasing prompt-cache reuse; it is a low-cost software change that often outperforms adding hardware.

Key finding

GGR reduces end-to-end LLM query latency by 1.5–3.4× vs. caching without reordering (Cache Original) on evaluated queries.

Numbers: 1.5–3.4× speedup (Sec 6.2; Fig 3/4)

Keyformer halves KV cache by keeping only 'key' tokens, doubling token throughput with no fine-tuning

0.80

0.65

0.80

6

Keyformer cuts memory traffic and latency for long-context generation without retraining, lowering inference cost and enabling higher throughput on existing GPU servers.

Key finding

Attention concentrates on a small subset of tokens ("key tokens").

Numbers: ≈90% of attention mass on ~40% of tokens (Fig.3b)

Pick INT or FP per layer: mixing low-bit formats (MoFQ) improves LLM quantization and speed

0.70

0.60

0.80

5

Mixing low-bit INT and FP per layer can keep model accuracy while cutting model size and quantization time; it fits current hardware that supports both INT and FP low-bit ops and reduces deployment cost.

Key finding

No single format (INT or FP) dominates across layers and bit widths.

Numbers: Weight tensors: INT8 lower MSE than FP8; at 4-bit no consistent winner (figures 4,6).

FlexRound: use element-wise division to learn per-weight scales and a shared grid for better PTQ

0.80

0.60

0.70

5

FlexRound lowers precision without heavy retraining, letting you run large models with INT8/INT4 weights and small calibration sets while keeping near-original accuracy.

Key finding

FlexRound keeps ImageNet accuracy close to full precision when quantizing weights to 4 bits.

Numbers: ResNet-50 Top1 75.95% vs full 76.63% (4-bit weights, Table 2)