Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.8
Citation Count
1
Why It Matters For Business
You can cut LLM layers to dramatically shrink model size and lower hosting and fine-tuning costs while keeping or improving classification accuracy on many few-shot tasks.
Summary TLDR
Cutting off top decoder layers and then fine-tuning with prompts lets popular decoder-only LLMs (GPT-2 XL, OPT-1.3B) lose most parameters while keeping or slightly improving accuracy on few-shot text classification. The paper shows 48→1-layer GPT-2 XL (1.6B→112M, −93% params) and 24→1-layer OPT (1.3B→157M, −88% params) achieve comparable or higher average accuracy on AGNews, EmoC, SST-2, and TREC. Topic tasks tolerate aggressive pruning; sentiment needs slightly deeper models.
Problem Statement
Large decoder-only LLMs are expensive to store and fine-tune because of many stacked layers. The paper asks whether you can drop top decoder layers and still adapt models for few-shot classification using prompt-based fine-tuning, thereby reducing memory and compute without large accuracy loss.
Main Contribution
Propose top-layer dropping: remove the highest k decoder layers and fine-tune remaining layers with prompt-style training.
Systematic experiments on few-shot text classification (AGNews, EmoC, SST-2, TREC) using GPT-2 XL and OPT-1.3B with multiple retained-layer counts.
Show that extreme layer reduction (down to 1–2 layers) often keeps or improves accuracy, giving large parameter, memory, and compute savings.
Key Findings
GPT-2 XL (48→2 layers) improves average accuracy compared to full model under prompt-based fine-tuning
OPT-1.3B (24→1 layer) increases average accuracy in prompt-based fine-tuning
Huge parameter reductions are possible with small accuracy changes
Layer-dropping effects hold across fine-tuning styles and head types
Task sensitivity: topic classification tolerates aggressive pruning; sentiment analysis often needs deeper layers
Results
Accuracy
Accuracy
Parameter count reduction
Behavior across fine-tuning heads
Who Should Care
What To Try In 7 Days
Take a decoder-only model (e.g., GPT-2 XL), drop top layers (try 1, 2, 6), and run prompt-based few-shot fine-tuning on your classification task.
Measure memory, disk size, latency, and validation accuracy; pick the shallowest model that meets your accuracy and latency targets.
If accuracy drops, try keeping more layers for sentiment-like tasks; compare LM head vs classification head performance.
Optimization Features
Infra Optimization
- Enables deployment on memory-limited hardware by using 1–2 layer variants
Model Optimization
- Layer-wise structured pruning (top-layer dropping)
- Remove entire decoder layers to reduce parameters
System Optimization
- Smaller checkpoints lower storage and transfer costs
Training Optimization
- Prompt-based fine-tuning (cloze-style prompts) for few-shot adaptation
- Use same hyperparameters across layer sizes to isolate layer effects
Inference Optimization
- Lower parameter count reduces memory footprint and compute per token
- Simpler models can reduce inference latency and hosting cost
Reproducibility
Data Urls
- AGNews (public)
- EmoC (public)
- SST-2 (public)
- TREC (public)
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Experiments limited to few-shot text classification, not generation or complex reasoning.
- Only two decoder-only model families (GPT-2 XL, OPT-1.3B) were tested.
- Batch size and compute were small (batch size 1); results may vary with different training regimes.
When Not To Use
- For generative tasks or reasoning-heavy tasks not covered by experiments.
- When task requires deep contextual or multi-hop reasoning, as sentiment-like tasks sometimes benefit from more layers.
- If you need off-the-shelf models preserving full pretrained behavior for many downstream tasks.
Failure Modes
- Over-pruning can degrade accuracy on nuanced tasks (e.g., some sentiment cases).
- Unexpected distribution shift may require deeper layers; shallow models may fail to generalize.
- Benchmarks used are relatively simple; real-world inputs may expose missing deep representations.
Core Entities
Models
- GPT-2 XL (48-layer, 1.6B)
- OPT-1.3B (24-layer, 1.3B)
Metrics
- Accuracy
Datasets
- AGNews
- EmoC (EmoContext)
- SST-2
- TREC
Context Entities
Metrics
- parameter count
- Accuracy

