FlexiGPT: prune or extend LLMs by replacing blocks with low-rank weight-sharing and LoRA adapters

January 24, 20257 min

Overview

Decision SnapshotReady For Pilot

Method is practical for on-device storage reduction and small-model extension; experiments cover several public models and tasks but require recovery training and modestly raise inference cost.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 40%

Production readiness: 70%

Novelty: 50%

Authors

James Seale Smith, Chi-Heng Lin, Shikhar Tuli, Haris Jeelani, Shangqian Gao, Yilin Shen, Hongxia Jin, Yen-Chang Hsu

Links

Abstract / PDF / Data

Why It Matters For Business

FlexiGPT cuts stored parameters by ~30% while keeping task accuracy far higher than naive pruning, enabling on-device deployment where storage matters; expect slightly higher inference cost and a brief recovery training step.

Who Should Care

Summary TLDR

FlexiGPT prunes whole transformer blocks and replaces them with similar unpruned blocks plus small low-rank adapters (LoRA) initialized from an SVD of the weight difference. It keeps much of the original accuracy while cutting stored parameters (~30%) at common pruning rates (30–40%). The method needs a short recovery fine-tune (1B tokens) and increases inference compute slightly. It also supports cheaply extending small models by repeating blocks with unique adapters and a small amount of continued pretraining (~10B tokens ≈ 0.3% extra).

Problem Statement

Large language models are too large for many devices. Existing pruning methods often drop performance and do little to recover capacity. We need a pruning approach that reduces parameter storage while restoring accuracy with minimal extra parameters and modest fine-tuning.

Main Contribution

A block-level pruning pipeline that selects blocks with a Block Influence score and replaces them using weight sharing plus LoRA adapters.

A low-rank SVD-based metric to choose which unpruned block to share as a replacement.

Key Findings

FlexiGPT gives much lower perplexity than ShortGPT after pruning.

NumbersLLaMA-2 7B, 30% prune: PPL 6.55 vs ShortGPT 22.76

Practical UseIf you need to prune a 7B model, replace-block+LoRA recovery keeps PPL near unpruned levels and avoids the heavy PPL cost of naïve layer removal.

Evidence RefTable 1

FlexiGPT is the strongest pruning baseline on common zero-shot tasks at 30–40% compression.

NumbersLLaMA-2 7B, 30% average zero-shot 62.68% (best among pruning methods); 40% best on all listed tasks

Practical UseUse FlexiGPT when you must reduce stored parameters by ~30–40% but still want the best task accuracy among pruning options.

Evidence RefTable 1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
PPL (LLaMA-2 7B, 30% prune, FlexiGPT)6.55ShortGPT 22.76-16.21MiniPile validation (PPL) / SlimPajama recoveryTable 1: PPL comparisonTable 1
Average zero-shot (LLaMA-2 7B, 30% prune, FlexiGPT)62.68%Unpruned 69.02%-6.34%ARC-e, ARC-c, PIQA, WinoGrande, HellaSwagTable 1: zero-shot averagesTable 1

What To Try In 7 Days

Run block-level pruning at 30% on a 7B model and perform 1B-token recovery fine-tune using LoRA adapters initialized by SVD.

Enable output feature normalization exactly as described; validate post-prune start PPL before longer fine-tune.

Compare storage saved vs latency impact on your target device; try self-speculative decoding to recoup throughput.

Optimization Features

Token Efficiency
Model extension achieved with ≈0.3% extra tokens relative to original pretraining for TinyLLaMA
Model Optimization
Block-level pruning guided by Block Influence scoreLow-rank weight-sharing replacement using similar unpruned blocksLoRA
System Optimization
FSDP + FP16 mixed precision used in experiments
Training Optimization
SVD-based initialization of adapters (low-rank difference)Short post-prune recovery fine-tune (1B tokens typical)Continued pretraining for extension experiments (10B tokens)
Inference Optimization
Self-speculative decoding: draft with pruned model, verify with full FlexiGPT

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusUnknown
LicenseUnknown

Data URLs

SlimPajamaMiniPile (validation)ARC-e/ARC-cPIQAWinoGrandeHellaSwag

Risks & Boundaries

Limitations

Does not reduce inference compute; normalized runtime increased to 105.1% vs unpruned in experiments.

Requires a post-prune recovery fine-tune (1B tokens) which adds compute and time.

When Not To Use

If inference latency or throughput is the top constraint and you cannot add decoding tricks.

If you cannot afford any post-prune fine-tuning compute.

Failure Modes

Removing output normalization or SVD init causes catastrophic start PPL and unstable recovery.

Weight-sharing base selection may pick poor matches if model layers have different functions, hurting recovery.

Core Entities

Models

LLaMA-2 7BLLaMA-3 8BOPT 6.7BOPT 1.3BTinyLLaMA 1.1B

Metrics

Perplexity (PPL)AccuracyNormalized compute timeThroughput

Datasets

SlimPajamaMiniPile (validation subset)ARC-eARC-cPIQAWinoGrandeHellaSwag

Benchmarks

ARCPIQAWinoGrandeHellaSwag