EmojiLM: a seq2seq English↔Emoji translator trained on a 503K synthetic parallel corpus

November 3, 20236 min

Overview

Decision SnapshotNeeds Validation

Promising practical gains for emoji tasks and few-shot transfer; conclusions rest on synthetic data and targeted benchmarks, so expect domain and cultural limits.

Citations1

Evidence Strength0.60

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 60%

Authors

Letian Peng, Zilong Wang, Hang Liu, Zihan Wang, Jingbo Shang

Links

Abstract / PDF / Code

Why It Matters For Business

Emoji-aware models enable richer user-facing features (emoji translation, emoji-labeled classification, UI localization) and improve low-data performance; the synthetic corpus offers a low-cost way to build such models.

Who Should Care

Summary TLDR

The authors synthesize a 503.7K English↔emoji parallel corpus (Text2Emoji) using gpt-3.5-turbo and train EmojiLM, a sequence-to-sequence translator (BART-based) for bidirectional text↔emoji translation. EmojiLM beats strong baselines on emoji prediction and improves few-shot transfer for emoji-formalized classification tasks. The corpus covers ~2.3K emoji tokens, and the model powers a public demo and Chrome extension. Results are promising but limited by synthetic-data bias and cultural skew toward popular emojis.

Problem Statement

Emoji research mostly handles single-emoji prediction from text. There is no large parallel corpus for translating between full text and emoji sequences, which blocks building models that treat emojis as a compositional 'language'. The paper creates such a corpus and a translator to enable richer emoji modeling and downstream transfer.

Main Contribution

Text2Emoji: a 503.7K English↔emoji parallel corpus synthesized from gpt-3.5-turbo covering ~2.3K emoji tokens.

EmojiLM: a distilled bidirectional text↔emoji translator (BART-based) with tokenizer changes for composed emojis.

Key Findings

Built Text2Emoji corpus with half a million parallel examples.

Numbers503.7K instances; 2.3K emoji vocab; avg text len 15.18

Practical UseYou can pre-train or fine-tune models on a large, diverse emoji dataset instead of relying only on small tweet emoji datasets.

Evidence RefTable 1; Section 3.1

EmojiLM improves supervised emoji prediction over baselines on TweetEval (20 labels).

NumbersEmojiLM macro-F1 34.8 vs BART 30.8 (TweetEval Emoji)

Practical UseUse EmojiLM-style pretraining to boost emoji-label tasks in production classifiers.

Evidence RefTable 4

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Text→emoji BLEU-1 (BART-L on Text2Emoji test)34.8Text2Emoji testTable 2 shows BLEU-1 34.8 for larger BART modelTable 2
TweetEval (Emoji, macro F1) - full supervision34.8BART 30.8+4.0TweetEval Emoji (20 labels)Table 4 full supervision row for EmojiLM vs BARTTable 4

What To Try In 7 Days

Download the code and run the demo to translate sample texts and inspect failure cases.

Fine-tune a BART model on Text2Emoji and evaluate on your emoji-related classification labels.

Replace rule-based emoji mapping in product flows with a lightweight seq2seq translator and A/B test user engagement.

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Corpus synthesized by gpt-3.5-turbo; the source of the LLM's emoji ability is not explained.

Corpus likely biased toward popular emojis and LLM training data, which can skew downstream models.

When Not To Use

When you already have large labeled datasets for the target task (no clear improvement).

When cultural nuance or balanced representation of rare emojis is critical.

Failure Modes

Bias toward popular emojis from LLM-synthesized corpus causes overuse of common emojis.

Ambiguous sentences may map to multiple valid emoji sequences; model picks one.

Core Entities

Models

EmojiLM (BART-based seq2seq)BARTT5BERTBERTweetgpt-3.5-turbo

Metrics

BLEU (B1-B4)BERTScoreMacro F1

Datasets

Text2Emoji (new, 503.7K)TweetEval (Emoji subset)Emoji-EXSentiment (TweetEval subset)Emotion (TweetEval subset)AG-NewsDBPedia

Benchmarks

TweetEvalAG-NewsDBPediaEmoji-EX