EmojiLM: a seq2seq English↔Emoji translator trained on a 503K synthetic parallel corpus

Overview

Decision SnapshotNeeds Validation

Promising practical gains for emoji tasks and few-shot transfer; conclusions rest on synthetic data and targeted benchmarks, so expect domain and cultural limits.

Citations1

Evidence Strength0.60

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 60%

Authors

Letian Peng, Zilong Wang, Hang Liu, Zihan Wang, Jingbo Shang

Links

Abstract / PDF / Code

Why It Matters For Business

Emoji-aware models enable richer user-facing features (emoji translation, emoji-labeled classification, UI localization) and improve low-data performance; the synthetic corpus offers a low-cost way to build such models.

Who Should Care

Product Manager ML Engineer Data Scientist Engineering Lead

Summary TLDR

The authors synthesize a 503.7K English↔emoji parallel corpus (Text2Emoji) using gpt-3.5-turbo and train EmojiLM, a sequence-to-sequence translator (BART-based) for bidirectional text↔emoji translation. EmojiLM beats strong baselines on emoji prediction and improves few-shot transfer for emoji-formalized classification tasks. The corpus covers ~2.3K emoji tokens, and the model powers a public demo and Chrome extension. Results are promising but limited by synthetic-data bias and cultural skew toward popular emojis.

Problem Statement

Emoji research mostly handles single-emoji prediction from text. There is no large parallel corpus for translating between full text and emoji sequences, which blocks building models that treat emojis as a compositional 'language'. The paper creates such a corpus and a translator to enable richer emoji modeling and downstream transfer.

Main Contribution

Text2Emoji: a 503.7K English↔emoji parallel corpus synthesized from gpt-3.5-turbo covering ~2.3K emoji tokens.

EmojiLM: a distilled bidirectional text↔emoji translator (BART-based) with tokenizer changes for composed emojis.

Key Findings

Built Text2Emoji corpus with half a million parallel examples.

Numbers503.7K instances; 2.3K emoji vocab; avg text len 15.18

Practical UseYou can pre-train or fine-tune models on a large, diverse emoji dataset instead of relying only on small tweet emoji datasets.

Evidence RefTable 1; Section 3.1

EmojiLM improves supervised emoji prediction over baselines on TweetEval (20 labels).

NumbersEmojiLM macro-F1 34.8 vs BART 30.8 (TweetEval Emoji)

Practical UseUse EmojiLM-style pretraining to boost emoji-label tasks in production classifiers.

Evidence RefTable 4

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Text→emoji BLEU-1 (BART-L on Text2Emoji test)	34.8	—	—	Text2Emoji test	Table 2 shows BLEU-1 34.8 for larger BART model	Table 2
TweetEval (Emoji, macro F1) - full supervision	34.8	BART 30.8	+4.0	TweetEval Emoji (20 labels)	Table 4 full supervision row for EmojiLM vs BART	Table 4

What To Try In 7 Days

Download the code and run the demo to translate sample texts and inspect failure cases.

Fine-tune a BART model on Text2Emoji and evaluate on your emoji-related classification labels.

Replace rule-based emoji mapping in product flows with a lightweight seq2seq translator and A/B test user engagement.

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/KomeijiForce/EmojiLM

Risks & Boundaries

Limitations

Corpus synthesized by gpt-3.5-turbo; the source of the LLM's emoji ability is not explained.

Corpus likely biased toward popular emojis and LLM training data, which can skew downstream models.

When Not To Use

When you already have large labeled datasets for the target task (no clear improvement).

When cultural nuance or balanced representation of rare emojis is critical.

Failure Modes

Bias toward popular emojis from LLM-synthesized corpus causes overuse of common emojis.

Ambiguous sentences may map to multiple valid emoji sequences; model picks one.

Core Entities

Models

EmojiLM (BART-based seq2seq)BARTT5BERTBERTweetgpt-3.5-turbo

Metrics

BLEU (B1-B4)BERTScoreMacro F1

Datasets

Text2Emoji (new, 503.7K)TweetEval (Emoji subset)Emoji-EXSentiment (TweetEval subset)Emotion (TweetEval subset)AG-NewsDBPedia

Benchmarks

TweetEvalAG-NewsDBPediaEmoji-EX

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Built Text2Emoji corpus with half a million parallel examples.

EmojiLM improves supervised emoji prediction over baselines on TweetEval (20 labels).

Results

What To Try In 7 Days

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

CoALM: one fine-tuned model that combines multi-turn dialogue state tracking with robust API / function calling

Key finding

First holistic Burmese benchmark (BURMESE-SAN) that tests LLMs on understanding, reasoning, and generation.

Key finding

Hamza: Turkish LLMs, adaptation vs from‑scratch, plus new Turkish benchmarks

Key finding

FinTral: a 7B multimodal financial LLM + FinSet dataset that rivals GPT-4 on many finance tasks

Key finding

Tune open LLMs into safer, better tool-using agents by aligning data to chat, decomposing capabilities, and adding negative samples

Key finding