Lillama: one‑shot, low‑rank feature distillation to shrink LLMs fast on one A100

December 21, 20247 min

Overview

Production Readiness

0.75

Novelty Score

0.6

Cost Impact Score

0.8

Citation Count

0

Authors

Yaya Sy, Christophe Cerisara, Irina Illina

Links

Abstract / PDF

Why It Matters For Business

Lillama lets teams cut model size and GPU memory quickly with only millions of calibration tokens, enabling cheaper deployment and larger context windows without large retraining costs.

Summary TLDR

Lillama is a fast one‑shot compression method that replaces selected weight matrices with low‑rank factors and locally distills layer activations. Key practical ingredients are SVD initialization, a joint Teacher+Student activation loss, and a bottom‑first layer selection that limits memory. Using ~13M calibration tokens, the authors remove ~10B parameters from Mixtral‑8x7B in minutes on a single A100 while keeping >95% zero‑shot accuracy. The method generalizes beyond Transformers (Mixture-of-Experts, Mamba, Whisper speech). Code is provided.

Problem Statement

Modern LLM compression often needs costly retraining on billions of tokens or complex kernels. The field lacks a simple, low‑data, compute‑efficient method that reduces parameters and memory without large accuracy drops.

Main Contribution

Lillama: a one‑shot low‑rank feature distillation algorithm that trains local low‑rank weight factors to match teacher activations.

Practical recipe: SVD initialization + Teacher+Student joint activation loss + local per‑layer optimization for fast convergence.

Memory‑aware layer selection (bottom‑first, top‑first, uniform) to limit GPU footprint and scale to very large models.

Demonstrated on Transformers, MoE, Mamba, and Whisper; open‑source code released.

Key Findings

Large model compression with small calibration data retains most performance.

NumbersMixtral‑8x7B: 20% → average 96% of base; Phi‑3 14B: 20% → 97% of base

Very small calibration datasets suffice for practical compression.

NumbersPhi‑2 3B: 40% reduction using 13M tokens → 1.7B model competes with other 1.5–1.8B models

SVD initialization and a joint Teacher+Student loss speed convergence.

NumbersSVD init: 8M tokens sufficient for fast convergence; Tea+Stu avg score 59.11 vs Teacher 57.51 (Table 8)

Compression reduces memory and often improves throughput.

NumbersPhi‑2 3B VRAM 6.8→5.7 GB and tokens/s 29.9k→32.4k (s=512); Mixtral fits single A100 after 20% compression

High compression needs fine‑tuning to recover capability.

NumbersMistral 7B at 40% kept 91% after 191M tokens of fine‑tuning (Table 5)

Results

Accuracy

Value63.59 (avg)

Baseline65.77 (0%)

Accuracy

Value60.19 (avg)

Baseline63.02 (0%)

Small model created (Phi-2 3B → 1.7B at 40% reduction)

ValueAverage 52.82

BaselinePhi-2 3B base avg 61.38

VRAM and throughput improvement (Phi-2 3B, s=512)

ValueVRAM 6.8GB→5.7GB; tokens/s 29.9k→32.4k

Baselineuncompressed Phi-2 3B

Who Should Care

What To Try In 7 Days

Run bottom‑first 20% compression on a dev LLM with 10–20M in‑domain calibration tokens.

Use SVD initialization and Teacher+Student loss to speed convergence and reduce distillation steps.

Measure VRAM, tokens/sec, and a small zero‑shot task suite to validate tradeoffs before fine‑tuning further.

Optimization Features

Token Efficiency

  • calibration with ~13M tokens for Phi-2 3B experiments
  • SVD initialization allowed convergence with ≈8M tokens in some cases

Infra Optimization

  • enables fitting large models (Mixtral 47B) onto a single A100 after compression

Model Optimization

  • low-rank decomposition of weight matrices (A,B factors via SVD)
  • local feature distillation per module (match activations)

System Optimization

  • no custom GPU kernel required; works with standard PyTorch/Huggingface
  • bottom‑first strategy avoids loading all weights at once

Training Optimization

  • SVD initialization to reduce training steps
  • Teacher+Student joint activation loss
  • local per‑layer optimizers and gradient updates

Inference Optimization

  • fewer parameters → lower VRAM
  • smaller matrices → faster tokens/sec and ability to fit larger contexts

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Compatibility with quantization is untested and may need a dedicated study before combined use (Section 10).
  • High compression (>30–40%) often requires extra fine‑tuning to recover accuracy.
  • Bottom‑first rank choices can trade memory against recoverable knowledge; small ranks may destroy features.
  • Evaluations focus on zero‑shot task suites and specific hardware (A100); other tasks or chips may differ.

When Not To Use

  • When you need provable parity with the base model on specific tasks without any fine‑tuning.
  • When you plan to quantize immediately and lack a plan for integrating both steps.
  • When calibration data matching the deployment domain is unavailable.

Failure Modes

  • Degraded performance on some reasoning benchmarks (ARC‑C) and commonsense (WinoG) after compression (Section 6.2).
  • Generation quality can become more repetitive and verbose at high compression (Appendix A.7).
  • Too aggressive bottom‑layer compression (small k) can remove core knowledge and hurt recovery.

Core Entities

Models

  • Mixtral-8x7B-v0.1 47B
  • Phi-3 14B
  • Phi-2 3B
  • Mistral-v0.1 7B
  • Mamba 3B
  • Falcon-Mamba 7B
  • Whisper-medium.en
  • InkubaLM (422M)

Metrics

  • Accuracy
  • perplexity
  • WER
  • tokens/sec
  • VRAM

Datasets

  • Slim-Orca
  • Alpaca (calib)
  • RedPajama
  • Wikitext2
  • Librispeech
  • Fleurs
  • Inkuba-Mono

Benchmarks

  • TruthfulQA
  • SocialIQA
  • LogiQA
  • WinoGrande
  • ARC-E
  • ARC-C
  • BoolQ
  • PIQA
  • OpenBookQA
  • afrimmlu