Predict multiple future words and train on word-difference targets to reduce local overfitting in causal language modeling

September 5, 20246 min

Overview

Production Readiness

0.6

Novelty Score

0.5

Cost Impact Score

0.4

Citation Count

1

Authors

DongNyeong Heo, Daniela Noemi Rim, Heeyoul Choi

Links

Abstract / PDF

Why It Matters For Business

Small, easy-to-add heads and a WDR target can lower perplexity and raise BLEU with little parameter cost; this improves model quality fast without reworking vocabulary or core architecture.

Summary TLDR

The paper extends causal language modeling (next-word training) by (1) predicting multiple future words with small extra MLP heads (N-gram CLM), (2) using word-difference representations (WDR) as contextual targets (difference of adjacent embeddings), and (3) ensembling predicted embeddings at test time. Across language-modeling (PPL) and machine-translation (BLEU) benchmarks the methods reduce perplexity and give small BLEU gains. WDR increases gradient diversity (argued to boost generalization). The changes add little parameter cost and slot into existing models without changing vocabularies or main loss.

Problem Statement

Next-word training (causal LM) can push models to overfit short, local word dependencies. That can hurt modeling of broader context. Prior multi-word prediction methods often need big architecture or loss changes. The paper asks: can we predict multiple future tokens, use contextual 'difference' targets, and ensemble predictions while keeping standard CLM architectures and losses?

Main Contribution

Simple N-gram CLM: add small MLP heads to predict future words from the same encoded state, re-using the original logit/vocabulary.

Word Difference Representation (WDR): use differences of contiguous embedding vectors as contextual, reversible target representations.

Ensemble method: combine multiple predicted embeddings (from different past time-steps) before the logit to refine next-word prediction at test time.

Key Findings

N-gram methods reduce perplexity on standard CLM benchmarks.

NumbersTT baseline PTB PPL 55.0 → TT+WDR ensemble 44.4 (−10.6)

WDR gives consistent gains over simple N-gram targets.

NumbersRF baseline PTB 28.0 → RF+WDR ensemble 25.9 (−2.1)

Small but consistent BLEU improvements in NMT when applying N-gram+WDR and ensemble.

NumbersIWSLT En→De 27.6 → 28.3 (+0.7); De→En 32.5 → 34.0 (+1.5)

WDR training increases gradient diversity during training.

Results

Perplexity (PTB, TT baseline → TT+WDR ensemble)

Value55.0 → 44.4

BaselineTT baseline 55.0

Perplexity (PTB, RF baseline → RF+WDR ensemble)

Value28.0 → 25.9

BaselineRF baseline 28.0

Perplexity (W103, TT baseline → TT+WDR ensemble)

Value20.1 → 16.9

BaselineTT baseline 20.1

BLEU (IWSLT14 En→De, TF baseline → TF+WDR ensemble)

Value27.6 → 28.3

BaselineTF baseline 27.6

BLEU (IWSLT14 De→En, TF baseline → TF+WDR ensemble)

Value32.5 → 34.0

BaselineTF baseline 32.5

Who Should Care

What To Try In 7 Days

Add 1–2 small MLP heads to your decoder/CLM to predict 1–2 future tokens (N=2 or 3).

Implement WDR targets (embedding differences) for those heads, detach conjugate term from gradients.

At test time ensemble predicted embeddings with λ≈0.3–0.5 and compare PPL/BLEU versus baseline.

Optimization Features

Model Optimization

  • Auxiliary MLP heads act as regularizers
  • WDR creates contextual, diverse targets

Training Optimization

  • Detaching conjugate term avoids recursive logit updates
  • WDR increases gradient diversity (better stochastic generalization)

Inference Optimization

  • Ensemble predicted embeddings before logit to refine scores (controlled by λ)

Reproducibility

Data Urls

  • PTB, WikiText-2, Text8, WikiText-103 (public datasets referenced)
  • IWSLT14, WMT14, WMT18 translation datasets (public)

Data Available

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • WDR gains are less consistent outside CLM (mixed results on MLM/GLUE).
  • Ensemble effect diminishes on very large datasets (Text8, WikiText-103).
  • Method requires detaching conjugate term; improper detachment can harm learning.

When Not To Use

  • If you already train at massive scale where ensemble gains vanish.
  • On masked-language tasks without careful WDR masking (paper reports mixed MLM results).

Failure Modes

  • Not detaching the conjugate term can let the logit embeddings be driven incorrectly.
  • Poor choice of λ can degrade next-word prediction versus baseline.
  • Additional MLP heads add compute and modest parameter cost; may hurt very tight budgets.

Core Entities

Models

  • Transformer (TF)
  • Tensorized Transformer (TT)
  • Reformer (RF)
  • CrammedBERT
  • Bag-of-words NMT (BOW NMT)

Metrics

  • Perplexity (PPL)
  • BLEU
  • Gradient Diversity (GD)
  • GLUE average

Datasets

  • Penn TreeBank (PTB)
  • WikiText-2
  • Text8
  • WikiText-103
  • IWSLT14 En-De
  • WMT14 En-De
  • WMT18 En-Tr

Benchmarks

  • Perplexity (word-level PPL)
  • BLEU (SacreBLEU)
  • GLUE (for masked LM ablation)