MorphPiece: a morpheme-aware tokenizer that improves LM and embedding quality

July 14, 20236 min

Overview

Decision SnapshotNeeds Validation

The idea is simple and practical: replace or augment BPE with a morpheme lookup. Results are consistent across LM and embedding benchmarks, but reported experiments use a single language (English), one model scale, and a curated MorphTable.

Citations2

Evidence Strength0.70

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 5/5

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 50%

Novelty: 60%

Authors

Haris Jabbar

Links

Abstract / PDF

Why It Matters For Business

MorphPiece yields better language modeling and embedding quality without changing model architecture, which can improve search, classification, and prediction pipelines but increases token counts and compute.

Who Should Care

Summary TLDR

MorphPiece is a hybrid tokenizer that uses a curated morpheme lookup (MorphTable) plus BPE for unseen words. A GPT-2–style model trained with MorphPiece (MorphGPT) shows lower perplexity, higher LAMBADA accuracy, and markedly better embedding performance on MTEB versus GPT-2, despite fewer training steps. Trade-offs: ~17% more tokens (higher compute) and added detokenization complexity.

Problem Statement

Current tokenizers use corpus statistics and ignore linguistic morphology. That can create unnatural subword splits that make language modeling and downstream tasks harder. The paper asks whether adding explicit morpheme segmentation into the tokenizer improves pretraining and downstream performance.

Main Contribution

MorphPiece: a tokenizer combining a morpheme lookup table (MorphTable) and BPE for unseen words.

MorphGPT: a GPT-2–base model trained from scratch with MorphPiece and evaluated across language modeling, GLUE zero-shot prompts, and MTEB embeddings.

Key Findings

MorphGPT lowers token-level perplexity vs GPT-2 on standard LM benchmarks.

NumbersPennTreeBank ppl 61.86 -> 38.25 (Morph200)

Practical UseExpect a smaller-model LM trained with MorphPiece to model next-token distributions more effectively; use it when perplexity matters.

Evidence RefTable 6 (Perplexity)

MorphGPT improves LAMBADA last-word accuracy.

NumbersLAMBADA acc 46.88% -> 58.50% (absolute +11.62 pts)

Practical UseFor tasks requiring broad-context word prediction, MorphPiece can give substantially better accuracy with similar model size.

Evidence RefTable 6 (LAMBADA acc)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Perplexity (PennTreeBank)38.25 (MorphPiece, 200k steps)61.86 (GPT-2 Base)-23.61 pplPennTreeBankTable 6 reports ppl 61.859 (GPT-2 Base) vs 38.251 (Morph200)Table 6
Accuracy58.50% (MorphPiece, 200k steps)46.88% (GPT-2 Base)+11.62 pct pointsLAMBADATable 6 LAMBADA acc 46.88 -> 58.50Table 6

What To Try In 7 Days

Build a small MorphTable from MorphyNet and replace tokenizer in a GPT-2 base training run to compare perplexity on a subset.

Measure embedding changes: compute classification/clustering metrics on one MTEB-like dataset using averaged last hidden state.

Estimate cost: rerun a few inference workloads to quantify the ~17% token-count increase and its compute/memory impact.

Optimization Features

Token Efficiency
fertility +17% (more subwords per word)

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

MorphTable built from MorphyNet does not cover all lexical families.

Detokenization is more complex and may fail on misspellings or noisy text.

When Not To Use

When strict token-budget or latency limits exist and ~17% more tokens are unacceptable.

On noisy or heavily misspelled text where detokenization may fail.

Failure Modes

Detokenizer mis-reconstructs words with unknown or misspelled morphemes.

Higher inference cost and memory pressure because of increased token sequences.

Core Entities

Models

MorphGPTGPT-2 (Base)GPT-2 (Large)

Metrics

perplexityAccuracyV-measureMRRMAPRecall@100SpearmanPearson

Datasets

OpenWebTextPennTreeBankWikiTextLAMBADAGLUEMTEBArXiv title classification datasets

Benchmarks

GLUEMTEBLAMBADA