Distill long-context transformers: cut inference cost ~45–58% while keeping ~90–99% of task accuracy

November 22, 20238 min

Overview

Production Readiness

0.7

Novelty Score

0.4

Cost Impact Score

0.8

Citation Count

1

Authors

Nathan Brown, Ashton Williamson, Tahj Anderson, Logan Lawrence

Links

Abstract / PDF

Why It Matters For Business

Converting and then distilling long-context transformers cuts inference cost and latency substantially while keeping most accuracy, letting teams serve longer documents cheaper and on smaller hardware.

Summary TLDR

This paper tests a practical pipeline—Convert-Then-Distill—where a pretrained transformer is converted to an efficient attention variant (supports long inputs), pretrained, then distilled into a smaller student. Across many models and tasks the distilled efficient students keep most of teacher accuracy (short-context: up to ~98.6%; long-context QA: ~94–95%; long-context NER: ~97.4%) while cutting inference time (average ~45%, up to ~58%) and modestly reducing peak GPU memory. The authors release a new long-context NER dataset (GONERD) and make models/data available on Hugging Face.

Problem Statement

Long-context transformer models reduce memory and enable processing of many thousands of tokens, but they remain costly to host and slow to run. Knowledge distillation (KD) can compress models, but we lack a clear evaluation of how KD interacts with efficient-attention architectures and which data and pipelines preserve long-context performance.

Main Contribution

Comprehensive evaluation of Convert-Then-Distill (convert teacher → pretrain → distill → finetune) for efficient-attention transformers on short and long-context tasks.

Release of GONERD, a new long-context Named Entity Recognition dataset, plus distilled and base models on Hugging Face.

Key Findings

Distilled efficient-attention students retain nearly all accuracy on short-context tasks.

NumbersUp to 98.6% of teacher performance preserved (short-context GLUE/SQuAD/CoNLL-2003).

Long-context QA keeps most teacher F1 but drops more than short tasks.

NumbersHotpotQA ~94.1% retained, TriviaQA ~95.0% retained (F1 of distilled vs teacher).

Distillation substantially reduces inference time for long sequences.

NumbersAverage 45.2% decrease; up to 57.8% (reported ≈58%) reduction in inference time on 4096-token inputs.

Simple student architecture change: students use half the teacher layers.

NumbersStudent hidden layers reduced by factor of two using every-other-layer initialization.

Choice of distillation data matters; mixed long+short corpora work best.

NumbersOSCAR+BookCorpus gives GLUE 78.9 (best) vs OSCAR alone 60.5; GONERD 67.7 vs OSCAR 38.3.

Convert and KD effects add but can harm long-context QA more than NER.

NumbersConvert+KD: -49.3% inference time, -20.4% GPU mem, but -11.6% HotpotQA and -15.0% TriviaQA F1 vs teacher.

Results

Inference time (long sequences, 4096 tokens)

ValueAverage -45.2% across distilled efficient models; up to -57.8% (≈58%)

BaselineTeacher efficient-attention models

Peak GPU memory (4096 tokens)

ValueAverage ~-2.6% across distilled efficient models; example Longformer -220 MB

BaselineTeacher efficient-attention models

Short-context task retention

ValueUp to 98.6% of teacher performance retained

BaselineTeacher models on GLUE, SQuAD, CoNLL-2003

Long-context QA retention

ValueHotpotQA ~94.1% retained; TriviaQA ~95.0% retained (F1)

BaselineTeacher long-context models

Long-context NER retention (GONERD)

ValueUp to 97.4% of teacher performance retained

BaselineTeacher efficient-attention models

Effect of distillation data

ValueOSCAR+BookCorpus yields best balanced performance (GLUE 78.9, GONERD 67.7)

BaselineOther distillation datasets (OSCAR, BC, ENW mixes)

Who Should Care

What To Try In 7 Days

Distill a converted Longformer-RoBERTa teacher into a half-depth student using every-other-layer initialization.

Use OSCAR+BookCorpus as distillation data (mix long + short contexts) and compare GLUE and a representative long-doc task.

Measure inference time and peak GPU memory on a target GPU (A100 or equivalent) at your production sequence length (e.g., 4096).

Optimization Features

Token Efficiency

  • Supports extended inputs up to 4096 tokens (efficient attention)
  • Truncation applied for sequences >4096 during training/inference

Model Optimization

  • Layer reduction: student uses every-other teacher layer (½ depth)
  • Architecture conversion to efficient attention (Longformer, Big Bird, LSG, Nyström)

System Optimization

  • Smaller students reduce serving latency and lower hosting costs

Training Optimization

  • Distillation with soft targets + hidden-state cosine loss (α=2, β=5, γ=1, T=2)
  • Pretrain converted teacher on long-context corpora before distillation

Inference Optimization

  • Reduced inference time on long inputs (avg -45.2%, up to -57.8%)
  • Slight reduction in peak GPU memory across distilled efficient students (~-2.6% avg)

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Study restricted to Convert-Then-Distill; Distill-Then-Convert not explored.
  • Distillation method is based on DistilBERT and may not be optimal for each efficient attention variant.
  • GONERD is biased toward news/legal web text (justice.gov heavy), so NER results may not generalize to other domains.
  • Some efficient models (LSG) suffered large performance drops under this distillation setup.

When Not To Use

  • When absolute top-tier QA performance on long-context benchmarks is required (note up to ~15% drop on some QA tasks).
  • If your target domain differs strongly from news/legal text without additional pretraining.
  • When you cannot afford the compute to pretrain converted teachers before distillation.

Failure Modes

  • Significant QA degradation after Convert+KD compared to teacher, especially on multi-hop or long QA.
  • Certain architectures (e.g., LSG) lose larger fractions of performance under the same distillation pipeline.
  • Overfitting to distillation corpora composition if only long or only short sequences are used.

Core Entities

Models

  • Longformer RoBERTa
  • Big Bird RoBERTa
  • LSG RoBERTa
  • Nyströmformer
  • RoBERTa
  • BERT
  • DistilBERT
  • DistilRoBERTa
  • TinyBERT
  • MobileBERT
  • ALBERT
  • XLM-R

Metrics

  • F1
  • Exact Match (EM)
  • Accuracy
  • Inference time (sec)
  • Peak GPU memory (MB)
  • Percent performance retained

Datasets

  • GLUE
  • SQuAD1.1
  • HotpotQA
  • TriviaQA
  • CoNLL-2003
  • GONERD
  • OSCAR
  • BookCorpus
  • English Wikipedia

Benchmarks

  • GLUE
  • SQuAD
  • HotpotQA
  • TriviaQA
  • CoNLL-2003
  • GONERD