Cut off top layers: keep or improve classification accuracy while cutting model size by >80%

February 18, 20247 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.8

Citation Count

1

Authors

Shuzhou Yuan, Ercong Nie, Bolei Ma, Michael Färber

Links

Abstract / PDF

Why It Matters For Business

You can cut LLM layers to dramatically shrink model size and lower hosting and fine-tuning costs while keeping or improving classification accuracy on many few-shot tasks.

Summary TLDR

Cutting off top decoder layers and then fine-tuning with prompts lets popular decoder-only LLMs (GPT-2 XL, OPT-1.3B) lose most parameters while keeping or slightly improving accuracy on few-shot text classification. The paper shows 48→1-layer GPT-2 XL (1.6B→112M, −93% params) and 24→1-layer OPT (1.3B→157M, −88% params) achieve comparable or higher average accuracy on AGNews, EmoC, SST-2, and TREC. Topic tasks tolerate aggressive pruning; sentiment needs slightly deeper models.

Problem Statement

Large decoder-only LLMs are expensive to store and fine-tune because of many stacked layers. The paper asks whether you can drop top decoder layers and still adapt models for few-shot classification using prompt-based fine-tuning, thereby reducing memory and compute without large accuracy loss.

Main Contribution

Propose top-layer dropping: remove the highest k decoder layers and fine-tune remaining layers with prompt-style training.

Systematic experiments on few-shot text classification (AGNews, EmoC, SST-2, TREC) using GPT-2 XL and OPT-1.3B with multiple retained-layer counts.

Show that extreme layer reduction (down to 1–2 layers) often keeps or improves accuracy, giving large parameter, memory, and compute savings.

Key Findings

GPT-2 XL (48→2 layers) improves average accuracy compared to full model under prompt-based fine-tuning

Numbers48-layer avg 77.04% → 2-layer avg 80.23% (Table II)

OPT-1.3B (24→1 layer) increases average accuracy in prompt-based fine-tuning

Numbers24-layer avg 73.00% → 1-layer avg 77.51% (Table II)

Huge parameter reductions are possible with small accuracy changes

NumbersGPT-2 XL: 1.6B→112M (≈93% fewer params); OPT-1.3B: 1.3B→157M (≈88% fewer) (Discussion)

Layer-dropping effects hold across fine-tuning styles and head types

NumbersVanilla and prompt-based fine-tuning both show small accuracy changes across layer counts (Tables II–IV)

Task sensitivity: topic classification tolerates aggressive pruning; sentiment analysis often needs deeper layers

NumbersAGNews/TREC best with 1–2 layers; SST-2 peaks at 2–6 layers depending on model (Results §V-A)

Results

Accuracy

Value48-layer 77.04% → 2-layer 80.23% → 1-layer 79.53%

Baseline48-layer 77.04%

Accuracy

Value24-layer 73.00% → 1-layer 77.51%

Baseline24-layer 73.00%

Parameter count reduction

ValueGPT-2 XL 1.6B→112M (≈93%); OPT-1.3B 1.3B→157M (≈88%)

BaselineFull models (1.6B, 1.3B)

Behavior across fine-tuning heads

ValuePerformance trends persist with classification heads and vanilla fine-tuning

Baselineprompt-based LM head results

Who Should Care

What To Try In 7 Days

Take a decoder-only model (e.g., GPT-2 XL), drop top layers (try 1, 2, 6), and run prompt-based few-shot fine-tuning on your classification task.

Measure memory, disk size, latency, and validation accuracy; pick the shallowest model that meets your accuracy and latency targets.

If accuracy drops, try keeping more layers for sentiment-like tasks; compare LM head vs classification head performance.

Optimization Features

Infra Optimization

  • Enables deployment on memory-limited hardware by using 1–2 layer variants

Model Optimization

  • Layer-wise structured pruning (top-layer dropping)
  • Remove entire decoder layers to reduce parameters

System Optimization

  • Smaller checkpoints lower storage and transfer costs

Training Optimization

  • Prompt-based fine-tuning (cloze-style prompts) for few-shot adaptation
  • Use same hyperparameters across layer sizes to isolate layer effects

Inference Optimization

  • Lower parameter count reduces memory footprint and compute per token
  • Simpler models can reduce inference latency and hosting cost

Reproducibility

Data Urls

  • AGNews (public)
  • EmoC (public)
  • SST-2 (public)
  • TREC (public)

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Experiments limited to few-shot text classification, not generation or complex reasoning.
  • Only two decoder-only model families (GPT-2 XL, OPT-1.3B) were tested.
  • Batch size and compute were small (batch size 1); results may vary with different training regimes.

When Not To Use

  • For generative tasks or reasoning-heavy tasks not covered by experiments.
  • When task requires deep contextual or multi-hop reasoning, as sentiment-like tasks sometimes benefit from more layers.
  • If you need off-the-shelf models preserving full pretrained behavior for many downstream tasks.

Failure Modes

  • Over-pruning can degrade accuracy on nuanced tasks (e.g., some sentiment cases).
  • Unexpected distribution shift may require deeper layers; shallow models may fail to generalize.
  • Benchmarks used are relatively simple; real-world inputs may expose missing deep representations.

Core Entities

Models

  • GPT-2 XL (48-layer, 1.6B)
  • OPT-1.3B (24-layer, 1.3B)

Metrics

  • Accuracy

Datasets

  • AGNews
  • EmoC (EmoContext)
  • SST-2
  • TREC

Context Entities

Metrics

  • parameter count
  • Accuracy