Overview
Production Readiness
0.7
Novelty Score
0.4
Cost Impact Score
0.8
Citation Count
1
Why It Matters For Business
Converting and then distilling long-context transformers cuts inference cost and latency substantially while keeping most accuracy, letting teams serve longer documents cheaper and on smaller hardware.
Summary TLDR
This paper tests a practical pipeline—Convert-Then-Distill—where a pretrained transformer is converted to an efficient attention variant (supports long inputs), pretrained, then distilled into a smaller student. Across many models and tasks the distilled efficient students keep most of teacher accuracy (short-context: up to ~98.6%; long-context QA: ~94–95%; long-context NER: ~97.4%) while cutting inference time (average ~45%, up to ~58%) and modestly reducing peak GPU memory. The authors release a new long-context NER dataset (GONERD) and make models/data available on Hugging Face.
Problem Statement
Long-context transformer models reduce memory and enable processing of many thousands of tokens, but they remain costly to host and slow to run. Knowledge distillation (KD) can compress models, but we lack a clear evaluation of how KD interacts with efficient-attention architectures and which data and pipelines preserve long-context performance.
Main Contribution
Comprehensive evaluation of Convert-Then-Distill (convert teacher → pretrain → distill → finetune) for efficient-attention transformers on short and long-context tasks.
Release of GONERD, a new long-context Named Entity Recognition dataset, plus distilled and base models on Hugging Face.
Key Findings
Distilled efficient-attention students retain nearly all accuracy on short-context tasks.
Long-context QA keeps most teacher F1 but drops more than short tasks.
Distillation substantially reduces inference time for long sequences.
Simple student architecture change: students use half the teacher layers.
Choice of distillation data matters; mixed long+short corpora work best.
Convert and KD effects add but can harm long-context QA more than NER.
Results
Inference time (long sequences, 4096 tokens)
Peak GPU memory (4096 tokens)
Short-context task retention
Long-context QA retention
Long-context NER retention (GONERD)
Effect of distillation data
Who Should Care
What To Try In 7 Days
Distill a converted Longformer-RoBERTa teacher into a half-depth student using every-other-layer initialization.
Use OSCAR+BookCorpus as distillation data (mix long + short contexts) and compare GLUE and a representative long-doc task.
Measure inference time and peak GPU memory on a target GPU (A100 or equivalent) at your production sequence length (e.g., 4096).
Optimization Features
Token Efficiency
- Supports extended inputs up to 4096 tokens (efficient attention)
- Truncation applied for sequences >4096 during training/inference
Model Optimization
- Layer reduction: student uses every-other teacher layer (½ depth)
- Architecture conversion to efficient attention (Longformer, Big Bird, LSG, Nyström)
System Optimization
- Smaller students reduce serving latency and lower hosting costs
Training Optimization
- Distillation with soft targets + hidden-state cosine loss (α=2, β=5, γ=1, T=2)
- Pretrain converted teacher on long-context corpora before distillation
Inference Optimization
- Reduced inference time on long inputs (avg -45.2%, up to -57.8%)
- Slight reduction in peak GPU memory across distilled efficient students (~-2.6% avg)
Reproducibility
Code Urls
Data Urls
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Study restricted to Convert-Then-Distill; Distill-Then-Convert not explored.
- Distillation method is based on DistilBERT and may not be optimal for each efficient attention variant.
- GONERD is biased toward news/legal web text (justice.gov heavy), so NER results may not generalize to other domains.
- Some efficient models (LSG) suffered large performance drops under this distillation setup.
When Not To Use
- When absolute top-tier QA performance on long-context benchmarks is required (note up to ~15% drop on some QA tasks).
- If your target domain differs strongly from news/legal text without additional pretraining.
- When you cannot afford the compute to pretrain converted teachers before distillation.
Failure Modes
- Significant QA degradation after Convert+KD compared to teacher, especially on multi-hop or long QA.
- Certain architectures (e.g., LSG) lose larger fractions of performance under the same distillation pipeline.
- Overfitting to distillation corpora composition if only long or only short sequences are used.
Core Entities
Models
- Longformer RoBERTa
- Big Bird RoBERTa
- LSG RoBERTa
- Nyströmformer
- RoBERTa
- BERT
- DistilBERT
- DistilRoBERTa
- TinyBERT
- MobileBERT
- ALBERT
- XLM-R
Metrics
- F1
- Exact Match (EM)
- Accuracy
- Inference time (sec)
- Peak GPU memory (MB)
- Percent performance retained
Datasets
- GLUE
- SQuAD1.1
- HotpotQA
- TriviaQA
- CoNLL-2003
- GONERD
- OSCAR
- BookCorpus
- English Wikipedia
Benchmarks
- GLUE
- SQuAD
- HotpotQA
- TriviaQA
- CoNLL-2003
- GONERD

