Overview
Production Readiness
0.75
Novelty Score
0.6
Cost Impact Score
0.8
Citation Count
0
Why It Matters For Business
Lillama lets teams cut model size and GPU memory quickly with only millions of calibration tokens, enabling cheaper deployment and larger context windows without large retraining costs.
Summary TLDR
Lillama is a fast one‑shot compression method that replaces selected weight matrices with low‑rank factors and locally distills layer activations. Key practical ingredients are SVD initialization, a joint Teacher+Student activation loss, and a bottom‑first layer selection that limits memory. Using ~13M calibration tokens, the authors remove ~10B parameters from Mixtral‑8x7B in minutes on a single A100 while keeping >95% zero‑shot accuracy. The method generalizes beyond Transformers (Mixture-of-Experts, Mamba, Whisper speech). Code is provided.
Problem Statement
Modern LLM compression often needs costly retraining on billions of tokens or complex kernels. The field lacks a simple, low‑data, compute‑efficient method that reduces parameters and memory without large accuracy drops.
Main Contribution
Lillama: a one‑shot low‑rank feature distillation algorithm that trains local low‑rank weight factors to match teacher activations.
Practical recipe: SVD initialization + Teacher+Student joint activation loss + local per‑layer optimization for fast convergence.
Memory‑aware layer selection (bottom‑first, top‑first, uniform) to limit GPU footprint and scale to very large models.
Demonstrated on Transformers, MoE, Mamba, and Whisper; open‑source code released.
Key Findings
Large model compression with small calibration data retains most performance.
Very small calibration datasets suffice for practical compression.
SVD initialization and a joint Teacher+Student loss speed convergence.
Compression reduces memory and often improves throughput.
High compression needs fine‑tuning to recover capability.
Results
Accuracy
Accuracy
Small model created (Phi-2 3B → 1.7B at 40% reduction)
VRAM and throughput improvement (Phi-2 3B, s=512)
Who Should Care
What To Try In 7 Days
Run bottom‑first 20% compression on a dev LLM with 10–20M in‑domain calibration tokens.
Use SVD initialization and Teacher+Student loss to speed convergence and reduce distillation steps.
Measure VRAM, tokens/sec, and a small zero‑shot task suite to validate tradeoffs before fine‑tuning further.
Optimization Features
Token Efficiency
- calibration with ~13M tokens for Phi-2 3B experiments
- SVD initialization allowed convergence with ≈8M tokens in some cases
Infra Optimization
- enables fitting large models (Mixtral 47B) onto a single A100 after compression
Model Optimization
- low-rank decomposition of weight matrices (A,B factors via SVD)
- local feature distillation per module (match activations)
System Optimization
- no custom GPU kernel required; works with standard PyTorch/Huggingface
- bottom‑first strategy avoids loading all weights at once
Training Optimization
- SVD initialization to reduce training steps
- Teacher+Student joint activation loss
- local per‑layer optimizers and gradient updates
Inference Optimization
- fewer parameters → lower VRAM
- smaller matrices → faster tokens/sec and ability to fit larger contexts
Reproducibility
Code Urls
Data Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Compatibility with quantization is untested and may need a dedicated study before combined use (Section 10).
- High compression (>30–40%) often requires extra fine‑tuning to recover accuracy.
- Bottom‑first rank choices can trade memory against recoverable knowledge; small ranks may destroy features.
- Evaluations focus on zero‑shot task suites and specific hardware (A100); other tasks or chips may differ.
When Not To Use
- When you need provable parity with the base model on specific tasks without any fine‑tuning.
- When you plan to quantize immediately and lack a plan for integrating both steps.
- When calibration data matching the deployment domain is unavailable.
Failure Modes
- Degraded performance on some reasoning benchmarks (ARC‑C) and commonsense (WinoG) after compression (Section 6.2).
- Generation quality can become more repetitive and verbose at high compression (Appendix A.7).
- Too aggressive bottom‑layer compression (small k) can remove core knowledge and hurt recovery.
Core Entities
Models
- Mixtral-8x7B-v0.1 47B
- Phi-3 14B
- Phi-2 3B
- Mistral-v0.1 7B
- Mamba 3B
- Falcon-Mamba 7B
- Whisper-medium.en
- InkubaLM (422M)
Metrics
- Accuracy
- perplexity
- WER
- tokens/sec
- VRAM
Datasets
- Slim-Orca
- Alpaca (calib)
- RedPajama
- Wikitext2
- Librispeech
- Fleurs
- Inkuba-Mono
Benchmarks
- TruthfulQA
- SocialIQA
- LogiQA
- WinoGrande
- ARC-E
- ARC-C
- BoolQ
- PIQA
- OpenBookQA
- afrimmlu

