Overview
Method is practical for on-device storage reduction and small-model extension; experiments cover several public models and tasks but require recovery training and modestly raise inference cost.
Citations0
Evidence Strength0.80
Confidence0.85
Risk Signals9
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 4/4
Reproducibility
Status: Partial assets available
Open source: Unknown
At A Glance
Cost impact: 40%
Production readiness: 70%
Novelty: 50%
Why It Matters For Business
FlexiGPT cuts stored parameters by ~30% while keeping task accuracy far higher than naive pruning, enabling on-device deployment where storage matters; expect slightly higher inference cost and a brief recovery training step.
Who Should Care
Summary TLDR
FlexiGPT prunes whole transformer blocks and replaces them with similar unpruned blocks plus small low-rank adapters (LoRA) initialized from an SVD of the weight difference. It keeps much of the original accuracy while cutting stored parameters (~30%) at common pruning rates (30–40%). The method needs a short recovery fine-tune (1B tokens) and increases inference compute slightly. It also supports cheaply extending small models by repeating blocks with unique adapters and a small amount of continued pretraining (~10B tokens ≈ 0.3% extra).
Problem Statement
Large language models are too large for many devices. Existing pruning methods often drop performance and do little to recover capacity. We need a pruning approach that reduces parameter storage while restoring accuracy with minimal extra parameters and modest fine-tuning.
Main Contribution
A block-level pruning pipeline that selects blocks with a Block Influence score and replaces them using weight sharing plus LoRA adapters.
A low-rank SVD-based metric to choose which unpruned block to share as a replacement.
Key Findings
FlexiGPT gives much lower perplexity than ShortGPT after pruning.
FlexiGPT is the strongest pruning baseline on common zero-shot tasks at 30–40% compression.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| PPL (LLaMA-2 7B, 30% prune, FlexiGPT) | 6.55 | ShortGPT 22.76 | -16.21 | MiniPile validation (PPL) / SlimPajama recovery | Table 1: PPL comparison | Table 1 |
| Average zero-shot (LLaMA-2 7B, 30% prune, FlexiGPT) | 62.68% | Unpruned 69.02% | -6.34% | ARC-e, ARC-c, PIQA, WinoGrande, HellaSwag | Table 1: zero-shot averages | Table 1 |
What To Try In 7 Days
Run block-level pruning at 30% on a 7B model and perform 1B-token recovery fine-tune using LoRA adapters initialized by SVD.
Enable output feature normalization exactly as described; validate post-prune start PPL before longer fine-tune.
Compare storage saved vs latency impact on your target device; try self-speculative decoding to recoup throughput.
Optimization Features
Token Efficiency
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Does not reduce inference compute; normalized runtime increased to 105.1% vs unpruned in experiments.
Requires a post-prune recovery fine-tune (1B tokens) which adds compute and time.
When Not To Use
If inference latency or throughput is the top constraint and you cannot add decoding tricks.
If you cannot afford any post-prune fine-tuning compute.
Failure Modes
Removing output normalization or SVD init causes catastrophic start PPL and unstable recovery.
Weight-sharing base selection may pick poor matches if model layers have different functions, hurting recovery.

