Cut off top layers: keep or improve classification accuracy while cutting model size by >80%

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.8

Citation Count

Authors

Shuzhou Yuan, Ercong Nie, Bolei Ma, Michael Färber

Links

Abstract / PDF

Why It Matters For Business

You can cut LLM layers to dramatically shrink model size and lower hosting and fine-tuning costs while keeping or improving classification accuracy on many few-shot tasks.

Summary TLDR

Cutting off top decoder layers and then fine-tuning with prompts lets popular decoder-only LLMs (GPT-2 XL, OPT-1.3B) lose most parameters while keeping or slightly improving accuracy on few-shot text classification. The paper shows 48→1-layer GPT-2 XL (1.6B→112M, −93% params) and 24→1-layer OPT (1.3B→157M, −88% params) achieve comparable or higher average accuracy on AGNews, EmoC, SST-2, and TREC. Topic tasks tolerate aggressive pruning; sentiment needs slightly deeper models.

Problem Statement

Large decoder-only LLMs are expensive to store and fine-tune because of many stacked layers. The paper asks whether you can drop top decoder layers and still adapt models for few-shot classification using prompt-based fine-tuning, thereby reducing memory and compute without large accuracy loss.

Main Contribution

Propose top-layer dropping: remove the highest k decoder layers and fine-tune remaining layers with prompt-style training.

Systematic experiments on few-shot text classification (AGNews, EmoC, SST-2, TREC) using GPT-2 XL and OPT-1.3B with multiple retained-layer counts.

Show that extreme layer reduction (down to 1–2 layers) often keeps or improves accuracy, giving large parameter, memory, and compute savings.

Key Findings

GPT-2 XL (48→2 layers) improves average accuracy compared to full model under prompt-based fine-tuning

Numbers48-layer avg 77.04% → 2-layer avg 80.23% (Table II)

OPT-1.3B (24→1 layer) increases average accuracy in prompt-based fine-tuning

Numbers24-layer avg 73.00% → 1-layer avg 77.51% (Table II)

Huge parameter reductions are possible with small accuracy changes

NumbersGPT-2 XL: 1.6B→112M (≈93% fewer params); OPT-1.3B: 1.3B→157M (≈88% fewer) (Discussion)

Layer-dropping effects hold across fine-tuning styles and head types

NumbersVanilla and prompt-based fine-tuning both show small accuracy changes across layer counts (Tables II–IV)

Task sensitivity: topic classification tolerates aggressive pruning; sentiment analysis often needs deeper layers

NumbersAGNews/TREC best with 1–2 layers; SST-2 peaks at 2–6 layers depending on model (Results §V-A)

Results

Accuracy

Value48-layer 77.04% → 2-layer 80.23% → 1-layer 79.53%

Baseline48-layer 77.04%

Accuracy

Value24-layer 73.00% → 1-layer 77.51%

Baseline24-layer 73.00%

Parameter count reduction

ValueGPT-2 XL 1.6B→112M (≈93%); OPT-1.3B 1.3B→157M (≈88%)

BaselineFull models (1.6B, 1.3B)

Behavior across fine-tuning heads

ValuePerformance trends persist with classification heads and vanilla fine-tuning

Baselineprompt-based LM head results

Who Should Care

CtoProduct ManagerMl EngineerData ScientistEngineering Lead

What To Try In 7 Days

Take a decoder-only model (e.g., GPT-2 XL), drop top layers (try 1, 2, 6), and run prompt-based few-shot fine-tuning on your classification task.

Measure memory, disk size, latency, and validation accuracy; pick the shallowest model that meets your accuracy and latency targets.

If accuracy drops, try keeping more layers for sentiment-like tasks; compare LM head vs classification head performance.

Optimization Features

Infra Optimization

Enables deployment on memory-limited hardware by using 1–2 layer variants

Model Optimization

Layer-wise structured pruning (top-layer dropping)
Remove entire decoder layers to reduce parameters

System Optimization

Smaller checkpoints lower storage and transfer costs

Training Optimization

Prompt-based fine-tuning (cloze-style prompts) for few-shot adaptation
Use same hyperparameters across layer sizes to isolate layer effects

Inference Optimization

Lower parameter count reduces memory footprint and compute per token
Simpler models can reduce inference latency and hosting cost

Reproducibility

Data Urls

AGNews (public)
EmoC (public)
SST-2 (public)
TREC (public)

Data Available

Open Source Status

partial

Risks & Boundaries

Limitations

Experiments limited to few-shot text classification, not generation or complex reasoning.
Only two decoder-only model families (GPT-2 XL, OPT-1.3B) were tested.
Batch size and compute were small (batch size 1); results may vary with different training regimes.

When Not To Use

For generative tasks or reasoning-heavy tasks not covered by experiments.
When task requires deep contextual or multi-hop reasoning, as sentiment-like tasks sometimes benefit from more layers.
If you need off-the-shelf models preserving full pretrained behavior for many downstream tasks.

Failure Modes

Over-pruning can degrade accuracy on nuanced tasks (e.g., some sentiment cases).
Unexpected distribution shift may require deeper layers; shallow models may fail to generalize.
Benchmarks used are relatively simple; real-world inputs may expose missing deep representations.

Overview

Production Readiness

Novelty Score

Cost Impact Score

Citation Count

Authors

Links

Why It Matters For Business

Summary TLDR

Problem Statement

Main Contribution

Key Findings

GPT-2 XL (48→2 layers) improves average accuracy compared to full model under prompt-based fine-tuning

OPT-1.3B (24→1 layer) increases average accuracy in prompt-based fine-tuning

Huge parameter reductions are possible with small accuracy changes

Layer-dropping effects hold across fine-tuning styles and head types

Task sensitivity: topic classification tolerates aggressive pruning; sentiment analysis often needs deeper layers

Results

Accuracy

Accuracy

Parameter count reduction

Behavior across fine-tuning heads

Who Should Care

What To Try In 7 Days

Optimization Features

Infra Optimization

Model Optimization

System Optimization

Training Optimization

Inference Optimization

Reproducibility

Data Urls

Data Available

Open Source Status

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Context Entities

Metrics

Related Papers