Compress LLaMA-2 7B to 2.1GB (70% fewer params) with 25% faster inference and ~2–3% accuracy drop

January 25, 20249 min

Overview

Decision SnapshotNeeds Validation

Results show large compression and speed gains on LlaMA‑2 7B with modest accuracy loss on many tasks, but math reasoning and hardware-specific quantization behavior need more tests.

Citations8

Evidence Strength0.70

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 7/7

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 60%

Novelty: 70%

Authors

Andrei Tomut, Saeed S. Jahromi, Abhijoy Sarkar, Uygar Kurt, Sukhbinder Singh, Faysal Ishtiaq, Cesar Muñoz, Prabdeep Singh Bajaj, Ali Elborady, Gianni del Bimbo, Mehrazin Alizadeh, David Montero, Pablo Martin-Ramiro, Muhammad Ibrahim, Oussama Tahiri Alaoui, John Malcolm, Samuel Mugel, Roman Orus

Links

Abstract / PDF

Why It Matters For Business

CompactifAI can cut model storage and runtime costs, enabling on-prem or cheaper-cloud LLM deployment with modest accuracy trade-offs for many tasks.

Who Should Care

Summary TLDR

CompactifAI replaces weight matrices in attention and MLP layers with quantum‑inspired tensor networks (matrix product operators, MPOs). A bond-dimension knob controls compression. After a short ‘healing’ retrain (<1 epoch on chat datasets) the authors compress LlaMA‑2 7B to 2.1 GB (93% memory reduction) and 2.1B parameters (≈70% fewer), speed training ~2x and inference ~25% while keeping most benchmark accuracy within 2–3% on MMLU, HellaSwag, BoolQ and TriviaQA; math (GSM8K) shows a larger drop.

Problem Statement

Large LLMs are costly to store, train, and run. Existing compression cuts neurons or precision and gives limited control over which correlations are removed. The paper asks: can we compress the correlation space directly, control truncation precisely, and keep accuracy while cutting memory and compute?

Main Contribution

CompactifAI: apply tensor networks (MPOs) to decompose weight matrices in SA and MLP layers, with bond dimension χ as a compression knob.

Show that short retraining ('healing') recovers accuracy after tensorization, making compressed models practical.

Key Findings

Memory reduced from 27.1 GB to 2.1 GB (93% reduction) on LlaMA‑2 7B using tensorization plus quantization.

Numbers27.1 GB → 2.1 GB (93% reduction)

Practical UseYou can cut model storage by an order of magnitude and move models to much cheaper GPUs or on-prem hardware.

Evidence RefTable I; paper summary

Parameter count reduced from 7B to 2.1B (≈70% fewer parameters) after tensor network compression.

Numbers7B2.1B (≈70% fewer)

Practical UseFewer parameters reduce distributed-transfer overhead and memory for checkpoints and optimizer states.

Evidence RefTable I; multiple paragraphs

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
memory size2.1 GB (compressed 93%)27.1 GB (original)−93%Table I shows original 27.1 GB and 93% compressed model at 2.1 GBTable I
parameter count2.1B (compressed)7B (original)≈−70%Table I reports 2.1B parameters after tensorization vs 7B originalTable I

What To Try In 7 Days

Run layer sensitivity profiling on your model to spot compressible layers.

Tensorize middle-to-end attention and MLP layers using MPOs with small χ.

Perform a short healing retrain (<1 epoch) on a small finetune set and measure accuracy loss vs cost savings.

Optimization Features

Infra Optimization
benefits workloads using many GPUs (less network/transfer overhead)
Model Optimization
tensor network (MPO) decomposition of weight matricesbond-dimension χ controls compression level
System Optimization
compatible with model & data parallelism on multi‑GPU clusters
Training Optimization
reduced GPU↔CPU transfer via much fewer parametersfaster distributed training from smaller parameter footprint
Inference Optimization
smaller forward pass tensors reduce latency (~25% faster)mixed precision: FP16 for tensorized layers, int4 for others

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Reported results are for LlaMA‑2 7B; generalization to other models not demonstrated.

Math benchmark (GSM8K) showed a larger accuracy drop for the most-compressed model.

When Not To Use

When precise numeric or complex reasoning (math) is critical and even small drops are unacceptable.

When you cannot afford any retraining or lack finetuning data.

Failure Modes

Over-compressing initial or last-block layers causes large accuracy loss.

Quantization may increase latency on GPUs without optimized int4 kernels.

Core Entities

Models

LlaMA-2 7BCompactifAI (tensor network / MPO compressed models)8-bit quantized LlaMA-2 7B4-bit quantized LlaMA-2 7B

Metrics

Accuracytraining time (minutes)inference time (ms)memory size (GB)parameter count

Datasets

UltrachatAlpacaOpenHermess

Benchmarks

MMLUHellaSwagBoolQTriviaQAGSM8K

Context Entities

Models

ChatGPT (mentioned)Meta LlaMA family (context)

Datasets

MMLU evaluation data