Compress LLaMA-2 7B to 2.1GB (70% fewer params) with 25% faster inference and ~2–3% accuracy drop

January 25, 20249 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.8

Citation Count

8

Authors

Andrei Tomut, Saeed S. Jahromi, Abhijoy Sarkar, Uygar Kurt, Sukhbinder Singh, Faysal Ishtiaq, Cesar Muñoz, Prabdeep Singh Bajaj, Ali Elborady, Gianni del Bimbo, Mehrazin Alizadeh, David Montero, Pablo Martin-Ramiro, Muhammad Ibrahim, Oussama Tahiri Alaoui, John Malcolm, Samuel Mugel, Roman Orus

Links

Abstract / PDF

Why It Matters For Business

CompactifAI can cut model storage and runtime costs, enabling on-prem or cheaper-cloud LLM deployment with modest accuracy trade-offs for many tasks.

Summary TLDR

CompactifAI replaces weight matrices in attention and MLP layers with quantum‑inspired tensor networks (matrix product operators, MPOs). A bond-dimension knob controls compression. After a short ‘healing’ retrain (<1 epoch on chat datasets) the authors compress LlaMA‑2 7B to 2.1 GB (93% memory reduction) and 2.1B parameters (≈70% fewer), speed training ~2x and inference ~25% while keeping most benchmark accuracy within 2–3% on MMLU, HellaSwag, BoolQ and TriviaQA; math (GSM8K) shows a larger drop.

Problem Statement

Large LLMs are costly to store, train, and run. Existing compression cuts neurons or precision and gives limited control over which correlations are removed. The paper asks: can we compress the correlation space directly, control truncation precisely, and keep accuracy while cutting memory and compute?

Main Contribution

CompactifAI: apply tensor networks (MPOs) to decompose weight matrices in SA and MLP layers, with bond dimension χ as a compression knob.

Show that short retraining ('healing') recovers accuracy after tensorization, making compressed models practical.

Combine tensorization with quantization (mixed FP16 and int4) to reach 93% memory reduction and 70% fewer parameters on LlaMA‑2 7B.

Provide layer sensitivity profiling showing middle-to-end layers tolerate much stronger compression than initial layers.

Key Findings

Memory reduced from 27.1 GB to 2.1 GB (93% reduction) on LlaMA‑2 7B using tensorization plus quantization.

Numbers27.1 GB → 2.1 GB (93% reduction)

Parameter count reduced from 7B to 2.1B (≈70% fewer parameters) after tensor network compression.

Numbers7B → 2.1B (≈70% fewer)

Distributed training speed improved about 2× (50% faster) on eight A10g GPUs for the healed tensorized models.

NumbersTraining time halved (2× speedup)

Inference forward time improved ≈25% for tensorized models; 4-bit quantization alone can slow some GPUs.

NumbersInference ~25% faster; int4 quantized model slowed by 13%

Accuracy on common benchmarks mostly within 2–3% of original, but math (GSM8K) dropped more for the most-compressed model.

NumbersMMLU 46.41→44.16 (-2.25); GSM8K 23.05→17.74 (-5.31)

Layer sensitivity: early layers are fragile to compression; middle and late attention blocks tolerate aggressive tensorization.

NumbersMiddle/end layers compress down to ≈10% with small loss; initial layers advised >50% retained

Results

memory size

Value2.1 GB (compressed 93%)

Baseline27.1 GB (original)

parameter count

Value2.1B (compressed)

Baseline7B (original)

training time

Value≈50% faster (2× speedup)

Baselineoriginal LlaMA‑2 7B

inference time

Value≈25% faster

Baselineoriginal LlaMA‑2 7B

Accuracy

Value44.16

Baseline46.41 (original)

Accuracy

Value76.54

Baseline80.55 (original)

Accuracy

Value17.74

Baseline23.05 (original)

Who Should Care

What To Try In 7 Days

Run layer sensitivity profiling on your model to spot compressible layers.

Tensorize middle-to-end attention and MLP layers using MPOs with small χ.

Perform a short healing retrain (<1 epoch) on a small finetune set and measure accuracy loss vs cost savings.

Optimization Features

Infra Optimization

  • benefits workloads using many GPUs (less network/transfer overhead)

Model Optimization

  • tensor network (MPO) decomposition of weight matrices
  • bond-dimension χ controls compression level

System Optimization

  • compatible with model & data parallelism on multi‑GPU clusters

Training Optimization

  • reduced GPU↔CPU transfer via much fewer parameters
  • faster distributed training from smaller parameter footprint

Inference Optimization

  • smaller forward pass tensors reduce latency (~25% faster)
  • mixed precision: FP16 for tensorized layers, int4 for others

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Reported results are for LlaMA‑2 7B; generalization to other models not demonstrated.
  • Math benchmark (GSM8K) showed a larger accuracy drop for the most-compressed model.
  • Quantization speed depends on GPU generation; int4 can slow inference on some hardware.
  • Healing was brief (<1 epoch); gains may require dataset-specific finetuning in practice.

When Not To Use

  • When precise numeric or complex reasoning (math) is critical and even small drops are unacceptable.
  • When you cannot afford any retraining or lack finetuning data.
  • When deployment hardware poorly supports low-bit operations (int4).

Failure Modes

  • Over-compressing initial or last-block layers causes large accuracy loss.
  • Quantization may increase latency on GPUs without optimized int4 kernels.
  • Insufficient healing/finetuning leaves compressed model underperforming on specialized tasks.

Core Entities

Models

  • LlaMA-2 7B
  • CompactifAI (tensor network / MPO compressed models)
  • 8-bit quantized LlaMA-2 7B
  • 4-bit quantized LlaMA-2 7B

Metrics

  • Accuracy
  • training time (minutes)
  • inference time (ms)
  • memory size (GB)
  • parameter count

Datasets

  • Ultrachat
  • Alpaca
  • OpenHermess

Benchmarks

  • MMLU
  • HellaSwag
  • BoolQ
  • TriviaQA
  • GSM8K

Context Entities

Models

  • ChatGPT (mentioned)
  • Meta LlaMA family (context)

Datasets

  • MMLU evaluation data