Lillama: one‑shot, low‑rank feature distillation to shrink LLMs fast on one A100

December 21, 20247 min

Overview

Decision SnapshotReady For Pilot

The method is practical: code exists, runs on one A100, and shows consistent memory/speed gains across models; evidence is empirical across several architectures but limited to the reported benchmarks and hardware.

Citations0

Evidence Strength0.85

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 75%

Novelty: 60%

Authors

Yaya Sy, Christophe Cerisara, Irina Illina

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Lillama lets teams cut model size and GPU memory quickly with only millions of calibration tokens, enabling cheaper deployment and larger context windows without large retraining costs.

Who Should Care

Summary TLDR

Lillama is a fast one‑shot compression method that replaces selected weight matrices with low‑rank factors and locally distills layer activations. Key practical ingredients are SVD initialization, a joint Teacher+Student activation loss, and a bottom‑first layer selection that limits memory. Using ~13M calibration tokens, the authors remove ~10B parameters from Mixtral‑8x7B in minutes on a single A100 while keeping >95% zero‑shot accuracy. The method generalizes beyond Transformers (Mixture-of-Experts, Mamba, Whisper speech). Code is provided.

Problem Statement

Modern LLM compression often needs costly retraining on billions of tokens or complex kernels. The field lacks a simple, low‑data, compute‑efficient method that reduces parameters and memory without large accuracy drops.

Main Contribution

Lillama: a one‑shot low‑rank feature distillation algorithm that trains local low‑rank weight factors to match teacher activations.

Practical recipe: SVD initialization + Teacher+Student joint activation loss + local per‑layer optimization for fast convergence.

Key Findings

Large model compression with small calibration data retains most performance.

NumbersMixtral‑8x7B: 20% → average 96% of base; Phi‑3 14B: 20%97% of base

Practical UseYou can cut ~20% of parameters on large LLMs and keep >95% zero‑shot accuracy on evaluated tasks; use bottom‑first strategy when memory is tight.

Evidence RefTable 2, Section 6.2

Very small calibration datasets suffice for practical compression.

NumbersPhi‑2 3B: 40% reduction using 13M tokens → 1.7B model competes with other 1.51.8B models

Practical UseFor small/medium models, try compressing with an order of 10M tokens instead of billions to get useful small models quickly.

Evidence RefAbstract, Table 4, Section 6.2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy63.59 (avg)65.77 (0%)-2.18 (≈ -3.3%)lm-evaluation-harness average over 9 tasksTable 2 Phi-3 14B (0%→20%)Table 2
Accuracy60.19 (avg)63.02 (0%)-2.83 (≈ -4.5%)lm-evaluation-harness average over 9 tasksTable 2 Mixtral-8x7B (0%→20%)Table 2

What To Try In 7 Days

Run bottom‑first 20% compression on a dev LLM with 10–20M in‑domain calibration tokens.

Use SVD initialization and Teacher+Student loss to speed convergence and reduce distillation steps.

Measure VRAM, tokens/sec, and a small zero‑shot task suite to validate tradeoffs before fine‑tuning further.

Optimization Features

Token Efficiency
calibration with ~13M tokens for Phi-2 3B experimentsSVD initialization allowed convergence with ≈8M tokens in some cases
Infra Optimization
enables fitting large models (Mixtral 47B) onto a single A100 after compression
Model Optimization
low-rank decomposition of weight matrices (A,B factors via SVD)local feature distillation per module (match activations)
System Optimization
no custom GPU kernel required; works with standard PyTorch/Huggingfacebottom‑first strategy avoids loading all weights at once
Training Optimization
SVD initialization to reduce training stepsTeacher+Student joint activation losslocal per‑layer optimizers and gradient updates
Inference Optimization
fewer parameters → lower VRAMsmaller matrices → faster tokens/sec and ability to fit larger contexts

Reproducibility

Risks & Boundaries

Limitations

Compatibility with quantization is untested and may need a dedicated study before combined use (Section 10).

High compression (>30–40%) often requires extra fine‑tuning to recover accuracy.

When Not To Use

When you need provable parity with the base model on specific tasks without any fine‑tuning.

When you plan to quantize immediately and lack a plan for integrating both steps.

Failure Modes

Degraded performance on some reasoning benchmarks (ARC‑C) and commonsense (WinoG) after compression (Section 6.2).

Generation quality can become more repetitive and verbose at high compression (Appendix A.7).

Core Entities

Models

Mixtral-8x7B-v0.1 47BPhi-3 14BPhi-2 3BMistral-v0.1 7BMamba 3BFalcon-Mamba 7BWhisper-medium.enInkubaLM (422M)

Metrics

AccuracyperplexityWERtokens/secVRAM

Datasets

Slim-OrcaAlpaca (calib)RedPajamaWikitext2LibrispeechFleursInkuba-Mono

Benchmarks

TruthfulQASocialIQALogiQAWinoGrandeARC-EARC-CBoolQPIQAOpenBookQAafrimmlu