MorphPiece: a morpheme-aware tokenizer that improves LM and embedding quality

July 14, 20236 min

Overview

Production Readiness

0.5

Novelty Score

0.6

Cost Impact Score

0.4

Citation Count

2

Authors

Haris Jabbar

Links

Abstract / PDF

Why It Matters For Business

MorphPiece yields better language modeling and embedding quality without changing model architecture, which can improve search, classification, and prediction pipelines but increases token counts and compute.

Summary TLDR

MorphPiece is a hybrid tokenizer that uses a curated morpheme lookup (MorphTable) plus BPE for unseen words. A GPT-2–style model trained with MorphPiece (MorphGPT) shows lower perplexity, higher LAMBADA accuracy, and markedly better embedding performance on MTEB versus GPT-2, despite fewer training steps. Trade-offs: ~17% more tokens (higher compute) and added detokenization complexity.

Problem Statement

Current tokenizers use corpus statistics and ignore linguistic morphology. That can create unnatural subword splits that make language modeling and downstream tasks harder. The paper asks whether adding explicit morpheme segmentation into the tokenizer improves pretraining and downstream performance.

Main Contribution

MorphPiece: a tokenizer combining a morpheme lookup table (MorphTable) and BPE for unseen words.

MorphGPT: a GPT-2–base model trained from scratch with MorphPiece and evaluated across language modeling, GLUE zero-shot prompts, and MTEB embeddings.

A detokenization algorithm to recompose morpheme tokens into words and an empirical comparison to BPE and FLOTA.

Key Findings

MorphGPT lowers token-level perplexity vs GPT-2 on standard LM benchmarks.

NumbersPennTreeBank ppl 61.86 -> 38.25 (Morph200)

MorphGPT improves LAMBADA last-word accuracy.

NumbersLAMBADA acc 46.88% -> 58.50% (absolute +11.62 pts)

Sequence embedding quality rises across MTEB tasks.

NumbersClassification 0.459 -> 0.537 (+17%); Clustering V-measure 0.124 -> 0.239 (+92.7%)

Retrieval metrics improved substantially on evaluated datasets.

NumbersRecall@100 0.051 -> 0.122 (+139%)

MorphPiece increases token count (fertility) versus BPE.

NumbersMorphPiece produces ~17% longer token sequences vs BPE

MorphGPT outperforms FLOTA-enhanced GPT-2 on ArXiv classification tasks.

NumbersArXiv-L test F1 0.536 -> 0.652 (+27.8% relative)

Results

Perplexity (PennTreeBank)

Value38.25 (MorphPiece, 200k steps)

Baseline61.86 (GPT-2 Base)

Accuracy

Value58.50% (MorphPiece, 200k steps)

Baseline46.88% (GPT-2 Base)

Accuracy

Value0.537 (MorphGPT)

Baseline0.459 (GPT-2)

MTEB Retrieval Recall@100

Value0.122 (MorphGPT)

Baseline0.051 (GPT-2)

MTEB Clustering V-measure

Value0.239 (MorphGPT)

Baseline0.124 (GPT-2)

Who Should Care

What To Try In 7 Days

Build a small MorphTable from MorphyNet and replace tokenizer in a GPT-2 base training run to compare perplexity on a subset.

Measure embedding changes: compute classification/clustering metrics on one MTEB-like dataset using averaged last hidden state.

Estimate cost: rerun a few inference workloads to quantify the ~17% token-count increase and its compute/memory impact.

Optimization Features

Token Efficiency

  • fertility +17% (more subwords per word)

Reproducibility

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • MorphTable built from MorphyNet does not cover all lexical families.
  • Detokenization is more complex and may fail on misspellings or noisy text.
  • MorphPiece increases token count by ~17%, raising compute and memory costs.
  • Language-specific MorphTable and detokenization automata needed per language.
  • Reported experiments are limited to a GPT-2–base scale; scaling behavior unknown.

When Not To Use

  • When strict token-budget or latency limits exist and ~17% more tokens are unacceptable.
  • On noisy or heavily misspelled text where detokenization may fail.
  • If you lack a reliable morpheme lexicon for your target language.

Failure Modes

  • Detokenizer mis-reconstructs words with unknown or misspelled morphemes.
  • Higher inference cost and memory pressure because of increased token sequences.
  • Coverage gaps in MorphTable lead to inconsistent tokenization across domains.

Core Entities

Models

  • MorphGPT
  • GPT-2 (Base)
  • GPT-2 (Large)

Metrics

  • perplexity
  • Accuracy
  • V-measure
  • MRR
  • MAP
  • Recall@100
  • Spearman
  • Pearson

Datasets

  • OpenWebText
  • PennTreeBank
  • WikiText
  • LAMBADA
  • GLUE
  • MTEB
  • ArXiv title classification datasets

Benchmarks

  • GLUE
  • MTEB
  • LAMBADA