Lillama: one‑shot, low‑rank feature distillation to shrink LLMs fast on one A100

Overview

Decision SnapshotReady For Pilot

The method is practical: code exists, runs on one A100, and shows consistent memory/speed gains across models; evidence is empirical across several architectures but limited to the reported benchmarks and hardware.

Citations0

Evidence Strength0.85

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 75%

Novelty: 60%

Authors

Yaya Sy, Christophe Cerisara, Irina Illina

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Lillama lets teams cut model size and GPU memory quickly with only millions of calibration tokens, enabling cheaper deployment and larger context windows without large retraining costs.

Who Should Care

CTO ML Engineer Engineering Lead Product Manager Data Scientist

Summary TLDR

Lillama is a fast one‑shot compression method that replaces selected weight matrices with low‑rank factors and locally distills layer activations. Key practical ingredients are SVD initialization, a joint Teacher+Student activation loss, and a bottom‑first layer selection that limits memory. Using ~13M calibration tokens, the authors remove ~10B parameters from Mixtral‑8x7B in minutes on a single A100 while keeping >95% zero‑shot accuracy. The method generalizes beyond Transformers (Mixture-of-Experts, Mamba, Whisper speech). Code is provided.

Problem Statement

Modern LLM compression often needs costly retraining on billions of tokens or complex kernels. The field lacks a simple, low‑data, compute‑efficient method that reduces parameters and memory without large accuracy drops.

Main Contribution

Lillama: a one‑shot low‑rank feature distillation algorithm that trains local low‑rank weight factors to match teacher activations.

Practical recipe: SVD initialization + Teacher+Student joint activation loss + local per‑layer optimization for fast convergence.

Key Findings

Large model compression with small calibration data retains most performance.

NumbersMixtral‑8x7B: 20% → average 96% of base; Phi‑3 14B: 20% → 97% of base

Practical UseYou can cut ~20% of parameters on large LLMs and keep >95% zero‑shot accuracy on evaluated tasks; use bottom‑first strategy when memory is tight.

Evidence RefTable 2, Section 6.2

Very small calibration datasets suffice for practical compression.

NumbersPhi‑2 3B: 40% reduction using 13M tokens → 1.7B model competes with other 1.5–1.8B models

Practical UseFor small/medium models, try compressing with an order of 10M tokens instead of billions to get useful small models quickly.

Evidence RefAbstract, Table 4, Section 6.2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	63.59 (avg)	65.77 (0%)	-2.18 (≈ -3.3%)	lm-evaluation-harness average over 9 tasks	Table 2 Phi-3 14B (0%→20%)	Table 2
Accuracy	60.19 (avg)	63.02 (0%)	-2.83 (≈ -4.5%)	lm-evaluation-harness average over 9 tasks	Table 2 Mixtral-8x7B (0%→20%)	Table 2

What To Try In 7 Days

Run bottom‑first 20% compression on a dev LLM with 10–20M in‑domain calibration tokens.

Use SVD initialization and Teacher+Student loss to speed convergence and reduce distillation steps.

Measure VRAM, tokens/sec, and a small zero‑shot task suite to validate tradeoffs before fine‑tuning further.

Optimization Features

Token Efficiency

calibration with ~13M tokens for Phi-2 3B experimentsSVD initialization allowed convergence with ≈8M tokens in some cases

Infra Optimization

enables fitting large models (Mixtral 47B) onto a single A100 after compression

Model Optimization

low-rank decomposition of weight matrices (A,B factors via SVD)local feature distillation per module (match activations)

System Optimization

no custom GPU kernel required; works with standard PyTorch/Huggingfacebottom‑first strategy avoids loading all weights at once

Training Optimization

SVD initialization to reduce training stepsTeacher+Student joint activation losslocal per‑layer optimizers and gradient updates

Inference Optimization

fewer parameters → lower VRAMsmaller matrices → faster tokens/sec and ability to fit larger contexts

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/yaya-sy/lillama

Data URLs

https://huggingface.co/datasets/Open-Orca/SlimOrca https://huggingface.co/microsoft/phi-2 https://huggingface.co/microsoft/Phi-3-medium-4k-instruct

Risks & Boundaries

Limitations

Compatibility with quantization is untested and may need a dedicated study before combined use (Section 10).

High compression (>30–40%) often requires extra fine‑tuning to recover accuracy.

When Not To Use

When you need provable parity with the base model on specific tasks without any fine‑tuning.

When you plan to quantize immediately and lack a plan for integrating both steps.

Failure Modes

Degraded performance on some reasoning benchmarks (ARC‑C) and commonsense (WinoG) after compression (Section 6.2).

Generation quality can become more repetitive and verbose at high compression (Appendix A.7).

Core Entities

Models

Mixtral-8x7B-v0.1 47BPhi-3 14BPhi-2 3BMistral-v0.1 7BMamba 3BFalcon-Mamba 7BWhisper-medium.enInkubaLM (422M)

Metrics

AccuracyperplexityWERtokens/secVRAM

Datasets

Slim-OrcaAlpaca (calib)RedPajamaWikitext2LibrispeechFleursInkuba-Mono

Benchmarks

TruthfulQASocialIQALogiQAWinoGrandeARC-EARC-CBoolQPIQAOpenBookQAafrimmlu

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Large model compression with small calibration data retains most performance.

Very small calibration datasets suffice for practical compression.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Pipeline that combines synthetic-data distillation, LoRA, Muon and GPTQ to make task-specialized LLMs fit on edge devices

Key finding

Teach small models to judge their own chain-of-thoughts and learn from multiple reasoning paths

Key finding

A practical survey of compression and speed tricks to run large language models on limited hardware

Key finding

Distill retrieval+evidence and simple graphs from big LLMs into small LMs to cut hallucinations and inference cost

Key finding

Cut Qwen2-Audio translation models by ~40–50% storage while keeping ~97–100% quality

Key finding