Cut LLM size by pruning neurons used outside your domain, keep performance after short fine-tuning

December 20, 20237 min

Overview

Production Readiness

0.4

Novelty Score

0.5

Cost Impact Score

0.6

Citation Count

3

Authors

Tim Valicenti, Justice Vidal, Ritik Patnaik

Links

Abstract / PDF

Why It Matters For Business

Contextual pruning can cut domain-model size quickly (often ~10%) and enable cheaper on-prem or lower-latency deployments while keeping task performance after short fine-tuning.

Summary TLDR

The paper introduces "contextual pruning": remove neurons and tokens that are rarely used for a specific domain, then fine-tune the trimmed model. On mild pruning thresholds (≈10^-3) the authors shrink models ~10% and keep or improve perplexity and multiple-choice accuracy after 1 epoch of fine-tune. Aggressive pruning (≈10^-1) can cut size up to ~42% but causes huge temporary perplexity spikes, needs many fine-tune epochs, and can overfit. Experiments use Phi-1.5, Opt-1.3, and Llama-1.3 on domain datasets (medical, legal, gaming, translation, economics). Code is available.

Problem Statement

Large open LLMs carry weights and token parameters irrelevant to a specific business domain. That unused capacity raises cost, latency, and on-prem barriers. The paper asks: can we identify and remove domain-unused neurons and embeddings (contextual pruning) to produce smaller domain models while preserving task performance?

Main Contribution

Define a practical contextual pruning pipeline that measures per-neuron L1 activation per domain and prunes low-usage rows/columns in linear and activation layers.

Extend pruning to embedding tokens by token-frequency comparison across domains.

Empirically evaluate pruning on three 1–1.4B parameter models (Phi-1.5, Opt-1.3, Llama-1.3) across five domain datasets and report perplexity, MCQ accuracy, recovery epochs, and relative size.

Show mild contextual pruning (~10^-3 threshold) yields ~10% size reduction with preserved or improved perplexity after short fine-tuning, and analyze limits when pruning aggressively (~10^-1).

Key Findings

Mild contextual pruning (linear+activation threshold 1e-3, embed prune <=0) reduced usable model size by ~10% while preserving or improving perplexity after brief fine-tuning.

NumbersPhi-1.5 Medical: relative size 90.134%, perplexity 4.64→4.579→2.722 (post-prune→fine-tune) (Table 3)

Aggressive pruning (threshold 1e-1) can reduce model size up to ~41.9% but often breaks performance and requires many fine-tune epochs to recover.

NumbersPhi-1.5 relative size 58.116% with post-prune perplexity spike 4.64→35417.9 and 25 recovery epochs (Table 5)

Task-level accuracy (100-domain MCQs) is mostly preserved or slightly changed under mild pruning; aggressive pruning often reduces MCQ accuracy after fine-tune, suggesting overfitting.

NumbersPhi-1.5 Skyrim MCQ: 62%→63%→63% at mild prune; at aggressive prune: 62%→28%→32% (Table 4 and Table 6)

Results

Perplexity (Phi-1.5, Medical, mild prune)

ValueBase 4.64 → Post-prune 4.579 → Fine-tune 2.722

BaselineBase 4.64

Relative size (Phi-1.5, mild prune)

Value90.134% (usable parameters retained)

Baseline100%

Perplexity spike (Phi-1.5, Medical, aggressive prune)

ValuePost-prune 35417.9 → Fine-tune 4.312 after 25 epochs

BaselineBase 4.64

Accuracy

ValueBase 62% → Post-prune 63% → Fine-tune 63%

Baseline62%

Accuracy

ValueBase 62% → Post-prune 28% → Fine-tune 32%

Baseline62%

Who Should Care

What To Try In 7 Days

Gather a small representative domain corpus (thousands of examples).

Compute per-neuron L1 activation across that corpus and apply a mild threshold (e.g., 1e-3).

Prune matching rows/columns in linear and activation layers and remove low-frequency tokens in embeddings with caution (need large calibration). Fine-tune for 1–5 epochs and measur

Optimization Features

Token Efficiency

  • embed pruning by token frequency (needs large calibration sets)

Infra Optimization

  • enables domain models with smaller memory footprint for deployment

Model Optimization

  • contextual pruning (per-neuron L1 norm)
  • linear layer pruning
  • activation layer pruning (GeLU/ReLU outputs)
  • embedding token pruning (by dataset token frequency)

System Optimization

  • pruning applied post-pretraining then fine-tune

Training Optimization

  • short fine-tune to recover perplexity
  • measure recovery epochs as stopping criterion

Inference Optimization

  • reduced usable parameters → lower memory/possible latency gains
  • smaller domain models easier for on-prem use

Reproducibility

Data Urls

  • wikitext-2-raw-v1 (HuggingFace)
  • lexlms (listed in paper)
  • Laurent1/MedQuad-MedicalQnADataset (HuggingFace name given)
  • zetavg/coct-en-zh-tw-translations-twp-300k (listed)
  • sentiment-lexicon-skyrim (listed)
  • tinymlFP (economics_text) (listed)

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Embedding pruning needs large calibration sets to avoid deleting rare but important tokens.
  • Aggressive pruning can cause massive perplexity spikes and long recovery; results sensitive to threshold choice.
  • Experiments limited to three older open models; modern larger architectures not evaluated.
  • Quantitative deploy metrics (latency, memory usage at inference) are not reported in detail.

When Not To Use

  • When you need a broadly general model across many domains without per-domain fine-tuning.
  • When you lack representative domain fine-tune data or compute for recovery epochs.
  • When token-level coverage is critical and you cannot collect large calibration corpora.

Failure Modes

  • Immediate large perplexity increases after aggressive pruning.
  • Long fine-tune required or inability to recover baseline performance.
  • Downstream task accuracy drops due to overfitting during recovery fine-tune.
  • Removing rare but vital tokens via embedding pruning.

Core Entities

Models

  • Phi-1.5
  • Opt-1.3
  • Llama-1.3

Metrics

  • perplexity
  • Accuracy
  • relative size (%)
  • recovery epochs

Datasets

  • wikitext-2-raw-v1 (general test)
  • lexlms (US law)
  • Laurent1/MedQuad-MedicalQnADataset (medical Q&A)
  • zetavg/coct-en-zh-tw-translations-twp-300k (English-Taiwanese translation)
  • sentiment-lexicon-skyrim (Skyrim transcript)
  • tinymlFP (economics_text)

Context Entities

Models

  • Phi-1.5
  • Opt-1.3
  • Llama-1.3

Metrics

  • same as core metrics

Datasets

  • same as core datasets above