Overview
The method is a modest algorithmic change (token scaling) with clear empirical gains and low overhead; results are consistent across multiple GLMs and public tasks.
Citations2
Evidence Strength0.80
Confidence0.85
Risk Signals8
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 5/5
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 80%
Production readiness: 70%
Novelty: 60%
Why It Matters For Business
TSLD lets you quantize decoder LMs to 2-bit ternary weights with near full-precision quality and little extra training cost, reducing model size and inference memory while preserving reasoning accuracy.
Who Should Care
Summary TLDR
The paper introduces Token-Scaled Logit Distillation (TSLD): a simple, memory-light distillation method that weights logit distillation per token by the teacher's token cross-entropy. TSLD enables quantization-aware training (QAT) down to ternary (2-bit) weights for decoder language models (GPT-2, OPT, LLaMA, GPT-Neo). On evaluated tasks, ternary QAT with TSLD keeps perplexity within ~1.0 of full-precision and improves or matches reasoning and NLU accuracy versus other QAT/PTQ baselines, while adding almost no training overhead compared to plain logit distillation.
Problem Statement
Generative decoder models suffer uneven, cumulative quantization error (masked causal attention) and overfitting when combining logit distillation with ground-truth loss. Existing QAT or PTQ either degrade perplexity or need high memory (layer-to-layer KD). The paper asks: can a light-weight distillation change avoid overfitting and recover token predictions for ternary-weight GLMs?
Main Contribution
Token-Scaled Logit Distillation (TSLD): scale logit KD per token by teacher token cross-entropy to reduce overfitting and emphasize uncertain tokens.
First large-scale evaluation of ternary-weight (2-bit) QAT on decoder GLMs up to ~7B parameters with <1.0 PPL degradation on evaluated benchmarks.
Key Findings
TSLD keeps PPL degradation under 1.0 vs full-precision on evaluated models with ternary weights.
TSLD improves downstream reasoning and QA accuracy compared to plain logit distillation.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Perplexity (PTB) | OPT-6.7B TSLD 11.00 | OPT-6.7B FP16 10.21 | +0.79 | PTB | Table 1 (PPL comparison) | Table 1 |
| Perplexity (PTB) | GPT-2 0.1B TSLD 19.95 | GPT-2 0.1B FP16 20.91 | -0.96 | PTB | Table 1 shows TSLD can even improve PPL for some small GPT-2 sizes | Table 1 |
What To Try In 7 Days
Run TSLD QAT on a task-fine-tuned 1–7B decoder model to test ternary-weight inference.
Compare PPL and task accuracy vs your current FP and PTQ baselines on a small validation set.
If using logit KD+GT and seeing overfitting, replace logit KD with TSLD (token-scaled weights).
Optimization Features
Token Efficiency
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Code URLs
Data URLs
Risks & Boundaries
Limitations
Experiments use A100 GPUs and pipeline parallelism; smaller infra may need engineering.
L2L KD still outperforms in encoder models; TSLD targets decoder GLMs specifically.
When Not To Use
If you need exact full-precision behavior for every example (sensitive safety-critical outputs).
If you cannot run QAT or lack the GPUs for teacher-student training.
Failure Modes
Using plain logit KD + GT naively can cause overfitting and worse eval loss.
L2L KD can run out of GPU memory for models >1.3B on 40GB GPUs.

