EmojiLM: a seq2seq English↔Emoji translator trained on a 503K synthetic parallel corpus

November 3, 20236 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

1

Authors

Letian Peng, Zilong Wang, Hang Liu, Zihan Wang, Jingbo Shang

Links

Abstract / PDF

Why It Matters For Business

Emoji-aware models enable richer user-facing features (emoji translation, emoji-labeled classification, UI localization) and improve low-data performance; the synthetic corpus offers a low-cost way to build such models.

Summary TLDR

The authors synthesize a 503.7K English↔emoji parallel corpus (Text2Emoji) using gpt-3.5-turbo and train EmojiLM, a sequence-to-sequence translator (BART-based) for bidirectional text↔emoji translation. EmojiLM beats strong baselines on emoji prediction and improves few-shot transfer for emoji-formalized classification tasks. The corpus covers ~2.3K emoji tokens, and the model powers a public demo and Chrome extension. Results are promising but limited by synthetic-data bias and cultural skew toward popular emojis.

Problem Statement

Emoji research mostly handles single-emoji prediction from text. There is no large parallel corpus for translating between full text and emoji sequences, which blocks building models that treat emojis as a compositional 'language'. The paper creates such a corpus and a translator to enable richer emoji modeling and downstream transfer.

Main Contribution

Text2Emoji: a 503.7K English↔emoji parallel corpus synthesized from gpt-3.5-turbo covering ~2.3K emoji tokens.

EmojiLM: a distilled bidirectional text↔emoji translator (BART-based) with tokenizer changes for composed emojis.

Demonstrated gains on emoji prediction and emotion tasks and notably better few-shot transfer versus standard baselines.

Public demo and a Chrome extension; code link provided.

Key Findings

Built Text2Emoji corpus with half a million parallel examples.

Numbers503.7K instances; 2.3K emoji vocab; avg text len 15.18

EmojiLM improves supervised emoji prediction over baselines on TweetEval (20 labels).

NumbersEmojiLM macro-F1 34.8 vs BART 30.8 (TweetEval Emoji)

EmojiLM gives much better few-shot performance on emoji prediction.

NumbersFew-shot (n-way 10-shot) TweetEval: 23.8 vs BART 11.4

Human evaluators overwhelmingly prefer EmojiLM to a string-matching baseline.

NumbersEmojiLM chosen for 88% of 200 samples vs string-matching

Results

Text→emoji BLEU-1 (BART-L on Text2Emoji test)

Value34.8

TweetEval (Emoji, macro F1) - full supervision

Value34.8

BaselineBART 30.8

Emoji-EX (32 labels, macro F1) - full supervision

Value23.5

BaselineBART 12.1

Few-shot TweetEval (n-way 10-shot, macro F1)

Value23.8

BaselineBART 11.4

Human preference vs string-matching translator

Value88%

BaselineEmoji-Translate

Who Should Care

What To Try In 7 Days

Download the code and run the demo to translate sample texts and inspect failure cases.

Fine-tune a BART model on Text2Emoji and evaluate on your emoji-related classification labels.

Replace rule-based emoji mapping in product flows with a lightweight seq2seq translator and A/B test user engagement.

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Corpus synthesized by gpt-3.5-turbo; the source of the LLM's emoji ability is not explained.
  • Corpus likely biased toward popular emojis and LLM training data, which can skew downstream models.
  • No strong gains on large supervised topic tasks, so benefits are task-dependent.
  • Composed emojis require special tokenization and may still be imperfectly handled.

When Not To Use

  • When you already have large labeled datasets for the target task (no clear improvement).
  • When cultural nuance or balanced representation of rare emojis is critical.
  • When strict grounding to factual visual content is required; model focuses on emoji semantics, not vision.

Failure Modes

  • Bias toward popular emojis from LLM-synthesized corpus causes overuse of common emojis.
  • Ambiguous sentences may map to multiple valid emoji sequences; model picks one.
  • Composed emoji handling may mis-tokenize or lose meaning for rare combinations.

Core Entities

Models

  • EmojiLM (BART-based seq2seq)
  • BART
  • T5
  • BERT
  • BERTweet
  • gpt-3.5-turbo

Metrics

  • BLEU (B1-B4)
  • BERTScore
  • Macro F1

Datasets

  • Text2Emoji (new, 503.7K)
  • TweetEval (Emoji subset)
  • Emoji-EX
  • Sentiment (TweetEval subset)
  • Emotion (TweetEval subset)
  • AG-News
  • DBPedia

Benchmarks

  • TweetEval
  • AG-News
  • DBPedia
  • Emoji-EX