Learn 5-token language triggers to boost multilingual LLM accuracy by ~3.7–19.9% on Global MMLU

February 27, 20255 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

0

Authors

Nathan Roll

Links

Abstract / PDF

Why It Matters For Business

PolyPrompt offers a low-cost way to raise non-English QA accuracy by training a few small embeddings per language instead of costly model fine-tuning.

Summary TLDR

PolyPrompt learns tiny, language-specific continuous trigger embeddings (k=5 tokens) and prepends them to inputs after detecting the input language. Only the trigger embeddings are trained (model frozen). On Llama 3.2 1B (base and instruct) across 15 languages in the Global MMLU benchmark, PolyPrompt gives absolute accuracy gains ranging roughly 3.7%–19.9% versus native and translation baselines. The approach is cheap to run (small embeddings, 2 epochs in experiments) but was only tested on multiple-choice MMLU and depends on language detection.

Problem Statement

Multilingual LLMs often underperform in non-English languages. Static or translate-then-answer prompts miss language-specific behaviors. We need a cheap, automated way to adapt prompts per language without full model fine-tuning.

Main Contribution

PolyPrompt: a dynamic autoprompting method that learns language-specific continuous trigger embeddings and applies them at inference after language detection.

A parameter-efficient training recipe that freezes the LLM and updates only trigger embeddings (k=5), demonstrating low-cost adaptation.

An evaluation on Global MMLU covering 15 diverse languages showing consistent accuracy gains over native and translation-based baselines.

Key Findings

PolyPrompt improves multilingual multiple-choice accuracy across tested languages.

NumbersAbsolute gains reported: 3.7%–19.9% (across languages, Table 1 / Abstract)

PolyPrompt outperforms a translate-then-autoprompt pipeline on English+Spanish.

NumbersPolyPrompt 39.3% vs Ext.Trans.+Autoprompt 31.3% (en+es avg, Table 2)

The method is parameter- and compute-efficient in the reported experiments.

NumbersUses k=5 trigger tokens; only triggers updated; trained up to 2 epochs; batch size 4 (Appendix B)

Results

Accuracy

ValuePolyPrompt@2epoch 50.8% vs Native 43.9%

BaselineNative MLLM 43.9%

Accuracy

ValuePolyPrompt 39.3% vs Ext.Trans.+Autoprompt 31.3%

BaselineExt. Translation + Autoprompt 31.3%

Per-language gains (example languages)

ValuePolyPrompt improves es/fr/it/de by >10% relative in some cases

BaselineBest non-PolyPrompt baseline per language

Who Should Care

What To Try In 7 Days

Run langid on your multilingual inputs and log detection accuracy.

Implement a placeholder token and learn k=5 trigger embeddings per target language on a small labeled sample.

Compare PolyPrompt to your current translate-then-answer pipeline on a held-out set.

Optimization Features

Token Efficiency

  • prompt token budget small (k=5)

Reproducibility

Data Urls

  • Global MMLU (Singh et al., 2024)

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Evaluation limited to Global MMLU multiple-choice benchmark and 15 languages.
  • Language detection (langid) can fail on low-resource languages or code-switched text.
  • Experiments used small models (1.2B) and short training (up to 2 epochs); larger-scale behavior unknown.
  • Paper does not provide a public code link in-text; reproduction details are partial.

When Not To Use

  • When inputs are heavily code-switched or language detection is unreliable.
  • For generation or translation tasks—only multiple-choice QA was evaluated.
  • If you require end-to-end model changes (PolyPrompt only modifies inputs via embeddings).

Failure Modes

  • Wrong language detection applies the incorrect trigger and can reduce accuracy.
  • Triggers may overfit to the benchmark distribution and not generalize to other tasks.
  • Defaulting to English triggers can hide gains for truly low-resource languages.

Core Entities

Models

  • Llama 3.2 1B Base
  • Llama 3.2 1B Instruct

Metrics

  • Accuracy

Datasets

  • Global MMLU (15 languages, Singh et al. 2024)

Benchmarks

  • Global MMLU