FlexiGPT: prune or extend LLMs by replacing blocks with low-rank weight-sharing and LoRA adapters

January 24, 20257 min

Overview

Production Readiness

0.7

Novelty Score

0.5

Cost Impact Score

0.4

Citation Count

0

Authors

James Seale Smith, Chi-Heng Lin, Shikhar Tuli, Haris Jeelani, Shangqian Gao, Yilin Shen, Hongxia Jin, Yen-Chang Hsu

Links

Abstract / PDF

Why It Matters For Business

FlexiGPT cuts stored parameters by ~30% while keeping task accuracy far higher than naive pruning, enabling on-device deployment where storage matters; expect slightly higher inference cost and a brief recovery training step.

Summary TLDR

FlexiGPT prunes whole transformer blocks and replaces them with similar unpruned blocks plus small low-rank adapters (LoRA) initialized from an SVD of the weight difference. It keeps much of the original accuracy while cutting stored parameters (~30%) at common pruning rates (30–40%). The method needs a short recovery fine-tune (1B tokens) and increases inference compute slightly. It also supports cheaply extending small models by repeating blocks with unique adapters and a small amount of continued pretraining (~10B tokens ≈ 0.3% extra).

Problem Statement

Large language models are too large for many devices. Existing pruning methods often drop performance and do little to recover capacity. We need a pruning approach that reduces parameter storage while restoring accuracy with minimal extra parameters and modest fine-tuning.

Main Contribution

A block-level pruning pipeline that selects blocks with a Block Influence score and replaces them using weight sharing plus LoRA adapters.

A low-rank SVD-based metric to choose which unpruned block to share as a replacement.

Adapter initialization using the low-rank SVD difference and output feature normalization to stabilize recovery.

A method to extend small models by repeating blocks with unique adapters and normalization parameters.

Key Findings

FlexiGPT gives much lower perplexity than ShortGPT after pruning.

NumbersLLaMA-2 7B, 30% prune: PPL 6.55 vs ShortGPT 22.76

FlexiGPT is the strongest pruning baseline on common zero-shot tasks at 30–40% compression.

NumbersLLaMA-2 7B, 30% average zero-shot 62.68% (best among pruning methods); 40% best on all listed tasks

Pruned models save stored parameters but slightly increase inference compute.

NumbersStored params reduced ≈30%; normalized runtime 105.1% vs unpruned 100%

Output feature normalization and SVD-based LoRA init are essential for stable recovery.

NumbersAblation start PPL: ablate output norm 8648.94 vs full method 21.82

Tiny models can be extended cheaply with repeated blocks and adapters.

NumbersTinyLLaMA 22→36 layers: PPL 6.84→6.73; average zero-shot 55.41%→56.13% after 10B tokens (~0.3% extended training)

Results

PPL (LLaMA-2 7B, 30% prune, FlexiGPT)

Value6.55

BaselineShortGPT 22.76

Average zero-shot (LLaMA-2 7B, 30% prune, FlexiGPT)

Value62.68%

BaselineUnpruned 69.02%

PPL (TinyLLaMA extend 22→36 layers, FlexiGPT Block)

Value6.73

BaselineBase 6.84

Normalized compute time (inference, LLaMA-2 7B)

Value105.1%

BaselineUnpruned 100.0%

Who Should Care

What To Try In 7 Days

Run block-level pruning at 30% on a 7B model and perform 1B-token recovery fine-tune using LoRA adapters initialized by SVD.

Enable output feature normalization exactly as described; validate post-prune start PPL before longer fine-tune.

Compare storage saved vs latency impact on your target device; try self-speculative decoding to recoup throughput.

Optimization Features

Token Efficiency

  • Model extension achieved with ≈0.3% extra tokens relative to original pretraining for TinyLLaMA

Model Optimization

  • Block-level pruning guided by Block Influence score
  • Low-rank weight-sharing replacement using similar unpruned blocks
  • LoRA

System Optimization

  • FSDP + FP16 mixed precision used in experiments

Training Optimization

  • SVD-based initialization of adapters (low-rank difference)
  • Short post-prune recovery fine-tune (1B tokens typical)
  • Continued pretraining for extension experiments (10B tokens)

Inference Optimization

  • Self-speculative decoding: draft with pruned model, verify with full FlexiGPT

Reproducibility

Data Urls

  • SlimPajama
  • MiniPile (validation)
  • ARC-e/ARC-c
  • PIQA
  • WinoGrande
  • HellaSwag

Data Available

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Does not reduce inference compute; normalized runtime increased to 105.1% vs unpruned in experiments.
  • Requires a post-prune recovery fine-tune (1B tokens) which adds compute and time.
  • Evaluated on a limited set of models and benchmarks; generality to all architectures not proven.

When Not To Use

  • If inference latency or throughput is the top constraint and you cannot add decoding tricks.
  • If you cannot afford any post-prune fine-tuning compute.
  • For architectures very different from evaluated transformer blocks without validating the weight-selection metric.

Failure Modes

  • Removing output normalization or SVD init causes catastrophic start PPL and unstable recovery.
  • Weight-sharing base selection may pick poor matches if model layers have different functions, hurting recovery.
  • Higher inference compute can make deployment infeasible on very low-power devices.

Core Entities

Models

  • LLaMA-2 7B
  • LLaMA-3 8B
  • OPT 6.7B
  • OPT 1.3B
  • TinyLLaMA 1.1B

Metrics

  • Perplexity (PPL)
  • Accuracy
  • Normalized compute time
  • Throughput

Datasets

  • SlimPajama
  • MiniPile (validation subset)
  • ARC-e
  • ARC-c
  • PIQA
  • WinoGrande
  • HellaSwag

Benchmarks

  • ARC
  • PIQA
  • WinoGrande
  • HellaSwag