Overview
Production Readiness
0.7
Novelty Score
0.5
Cost Impact Score
0.4
Citation Count
0
Why It Matters For Business
FlexiGPT cuts stored parameters by ~30% while keeping task accuracy far higher than naive pruning, enabling on-device deployment where storage matters; expect slightly higher inference cost and a brief recovery training step.
Summary TLDR
FlexiGPT prunes whole transformer blocks and replaces them with similar unpruned blocks plus small low-rank adapters (LoRA) initialized from an SVD of the weight difference. It keeps much of the original accuracy while cutting stored parameters (~30%) at common pruning rates (30–40%). The method needs a short recovery fine-tune (1B tokens) and increases inference compute slightly. It also supports cheaply extending small models by repeating blocks with unique adapters and a small amount of continued pretraining (~10B tokens ≈ 0.3% extra).
Problem Statement
Large language models are too large for many devices. Existing pruning methods often drop performance and do little to recover capacity. We need a pruning approach that reduces parameter storage while restoring accuracy with minimal extra parameters and modest fine-tuning.
Main Contribution
A block-level pruning pipeline that selects blocks with a Block Influence score and replaces them using weight sharing plus LoRA adapters.
A low-rank SVD-based metric to choose which unpruned block to share as a replacement.
Adapter initialization using the low-rank SVD difference and output feature normalization to stabilize recovery.
A method to extend small models by repeating blocks with unique adapters and normalization parameters.
Key Findings
FlexiGPT gives much lower perplexity than ShortGPT after pruning.
FlexiGPT is the strongest pruning baseline on common zero-shot tasks at 30–40% compression.
Pruned models save stored parameters but slightly increase inference compute.
Output feature normalization and SVD-based LoRA init are essential for stable recovery.
Tiny models can be extended cheaply with repeated blocks and adapters.
Results
PPL (LLaMA-2 7B, 30% prune, FlexiGPT)
Average zero-shot (LLaMA-2 7B, 30% prune, FlexiGPT)
PPL (TinyLLaMA extend 22→36 layers, FlexiGPT Block)
Normalized compute time (inference, LLaMA-2 7B)
Who Should Care
What To Try In 7 Days
Run block-level pruning at 30% on a 7B model and perform 1B-token recovery fine-tune using LoRA adapters initialized by SVD.
Enable output feature normalization exactly as described; validate post-prune start PPL before longer fine-tune.
Compare storage saved vs latency impact on your target device; try self-speculative decoding to recoup throughput.
Optimization Features
Token Efficiency
- Model extension achieved with ≈0.3% extra tokens relative to original pretraining for TinyLLaMA
Model Optimization
- Block-level pruning guided by Block Influence score
- Low-rank weight-sharing replacement using similar unpruned blocks
- LoRA
System Optimization
- FSDP + FP16 mixed precision used in experiments
Training Optimization
- SVD-based initialization of adapters (low-rank difference)
- Short post-prune recovery fine-tune (1B tokens typical)
- Continued pretraining for extension experiments (10B tokens)
Inference Optimization
- Self-speculative decoding: draft with pruned model, verify with full FlexiGPT
Reproducibility
Data Urls
- SlimPajama
- MiniPile (validation)
- ARC-e/ARC-c
- PIQA
- WinoGrande
- HellaSwag
Data Available
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Does not reduce inference compute; normalized runtime increased to 105.1% vs unpruned in experiments.
- Requires a post-prune recovery fine-tune (1B tokens) which adds compute and time.
- Evaluated on a limited set of models and benchmarks; generality to all architectures not proven.
When Not To Use
- If inference latency or throughput is the top constraint and you cannot add decoding tricks.
- If you cannot afford any post-prune fine-tuning compute.
- For architectures very different from evaluated transformer blocks without validating the weight-selection metric.
Failure Modes
- Removing output normalization or SVD init causes catastrophic start PPL and unstable recovery.
- Weight-sharing base selection may pick poor matches if model layers have different functions, hurting recovery.
- Higher inference compute can make deployment infeasible on very low-power devices.
Core Entities
Models
- LLaMA-2 7B
- LLaMA-3 8B
- OPT 6.7B
- OPT 1.3B
- TinyLLaMA 1.1B
Metrics
- Perplexity (PPL)
- Accuracy
- Normalized compute time
- Throughput
Datasets
- SlimPajama
- MiniPile (validation subset)
- ARC-e
- ARC-c
- PIQA
- WinoGrande
- HellaSwag
Benchmarks
- ARC
- PIQA
- WinoGrande
- HellaSwag

