FlexiGPT: prune or extend LLMs by replacing blocks with low-rank weight-sharing and LoRA adapters

Overview

Decision SnapshotReady For Pilot

Method is practical for on-device storage reduction and small-model extension; experiments cover several public models and tasks but require recovery training and modestly raise inference cost.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 40%

Production readiness: 70%

Novelty: 50%

Authors

James Seale Smith, Chi-Heng Lin, Shikhar Tuli, Haris Jeelani, Shangqian Gao, Yilin Shen, Hongxia Jin, Yen-Chang Hsu

Links

Abstract / PDF / Data

Why It Matters For Business

FlexiGPT cuts stored parameters by ~30% while keeping task accuracy far higher than naive pruning, enabling on-device deployment where storage matters; expect slightly higher inference cost and a brief recovery training step.

Who Should Care

CTO ML Engineer Product Manager Engineering Lead Founder

Summary TLDR

FlexiGPT prunes whole transformer blocks and replaces them with similar unpruned blocks plus small low-rank adapters (LoRA) initialized from an SVD of the weight difference. It keeps much of the original accuracy while cutting stored parameters (~30%) at common pruning rates (30–40%). The method needs a short recovery fine-tune (1B tokens) and increases inference compute slightly. It also supports cheaply extending small models by repeating blocks with unique adapters and a small amount of continued pretraining (~10B tokens ≈ 0.3% extra).

Problem Statement

Large language models are too large for many devices. Existing pruning methods often drop performance and do little to recover capacity. We need a pruning approach that reduces parameter storage while restoring accuracy with minimal extra parameters and modest fine-tuning.

Main Contribution

A block-level pruning pipeline that selects blocks with a Block Influence score and replaces them using weight sharing plus LoRA adapters.

A low-rank SVD-based metric to choose which unpruned block to share as a replacement.

Key Findings

FlexiGPT gives much lower perplexity than ShortGPT after pruning.

NumbersLLaMA-2 7B, 30% prune: PPL 6.55 vs ShortGPT 22.76

Practical UseIf you need to prune a 7B model, replace-block+LoRA recovery keeps PPL near unpruned levels and avoids the heavy PPL cost of naïve layer removal.

Evidence RefTable 1

FlexiGPT is the strongest pruning baseline on common zero-shot tasks at 30–40% compression.

NumbersLLaMA-2 7B, 30% average zero-shot 62.68% (best among pruning methods); 40% best on all listed tasks

Practical UseUse FlexiGPT when you must reduce stored parameters by ~30–40% but still want the best task accuracy among pruning options.

Evidence RefTable 1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
PPL (LLaMA-2 7B, 30% prune, FlexiGPT)	6.55	ShortGPT 22.76	-16.21	MiniPile validation (PPL) / SlimPajama recovery	Table 1: PPL comparison	Table 1
Average zero-shot (LLaMA-2 7B, 30% prune, FlexiGPT)	62.68%	Unpruned 69.02%	-6.34%	ARC-e, ARC-c, PIQA, WinoGrande, HellaSwag	Table 1: zero-shot averages	Table 1

What To Try In 7 Days

Run block-level pruning at 30% on a 7B model and perform 1B-token recovery fine-tune using LoRA adapters initialized by SVD.

Enable output feature normalization exactly as described; validate post-prune start PPL before longer fine-tune.

Compare storage saved vs latency impact on your target device; try self-speculative decoding to recoup throughput.

Optimization Features

Token Efficiency

Model extension achieved with ≈0.3% extra tokens relative to original pretraining for TinyLLaMA

Model Optimization

Block-level pruning guided by Block Influence scoreLow-rank weight-sharing replacement using similar unpruned blocksLoRA

System Optimization

FSDP + FP16 mixed precision used in experiments

Training Optimization

SVD-based initialization of adapters (low-rank difference)Short post-prune recovery fine-tune (1B tokens typical)Continued pretraining for extension experiments (10B tokens)

Inference Optimization

Self-speculative decoding: draft with pruned model, verify with full FlexiGPT

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusUnknown

LicenseUnknown

Data URLs

SlimPajamaMiniPile (validation)ARC-e/ARC-cPIQAWinoGrandeHellaSwag

Risks & Boundaries

Limitations

Does not reduce inference compute; normalized runtime increased to 105.1% vs unpruned in experiments.

Requires a post-prune recovery fine-tune (1B tokens) which adds compute and time.

When Not To Use

If inference latency or throughput is the top constraint and you cannot add decoding tricks.

If you cannot afford any post-prune fine-tuning compute.

Failure Modes

Removing output normalization or SVD init causes catastrophic start PPL and unstable recovery.

Weight-sharing base selection may pick poor matches if model layers have different functions, hurting recovery.

Core Entities

Models

LLaMA-2 7BLLaMA-3 8BOPT 6.7BOPT 1.3BTinyLLaMA 1.1B

Metrics

Perplexity (PPL)AccuracyNormalized compute timeThroughput

Datasets

SlimPajamaMiniPile (validation subset)ARC-eARC-cPIQAWinoGrandeHellaSwag

Benchmarks

ARCPIQAWinoGrandeHellaSwag

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

FlexiGPT gives much lower perplexity than ShortGPT after pruning.

FlexiGPT is the strongest pruning baseline on common zero-shot tasks at 30–40% compression.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

A practical survey of compression and speed tricks to run large language models on limited hardware

Key finding

Practical survey of quantization, pruning, distillation, and decoding tricks to make LLMs cheaper and faster

Key finding

Smaller, faster NLLB-based models for 15 African language pairs, with released data and code

Key finding

Compression can preserve or break LLM trust: 4-bit quantization often keeps or even improves ethics/fairness, pruning and 3-bit quantization

Key finding

Use LLM agents + runtime profiling to pick layerwise pruning and post-training dynamic quantization automatically

Key finding