Small wording or punctuation changes in Japanese prompts can cut LLM accuracy by half; model and language data matter.

Overview

Decision SnapshotNeeds Validation

Paper gives clear zero-shot numeric comparisons across models and prompts, but uses limited samples, only Japanese tasks, and no code release, so conclusions are directional not definitive.

Citations5

Evidence Strength0.60

Confidence0.60

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 1/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 30%

Production readiness: 40%

Novelty: 50%

Authors

Chengguang Gan, Tatsunori Mori

Links

Abstract / PDF / Data

Why It Matters For Business

Small prompt wording or punctuation changes can drastically change accuracy on non-English tasks; companies must test prompt variants and consider light in-language adapters to avoid surprise drops in production.

Who Should Care

Product Manager ML Engineer Data Scientist Engineering Lead CTO

Summary TLDR

The authors test zero-shot prompt templates on Japanese text classification (MARC-ja, JNLI, JSTS). Results show large sensitivity: GPT-4 is very accurate on some prompts but can drop ~24 percentage points on others for the same task. A LLaMA-7B model adapted with a Sino-Japanese LoRA adapter matches or exceeds larger models on some prompts. Key takeaway: test multiple prompt phrasings and consider lightweight in-language adaptation before deploying LLMs for non-major languages.

Problem Statement

Prompt wording and small grammatical changes can strongly change LLM outputs. Little is known about this sensitivity for non-dominant languages like Japanese. The paper asks how robust current LLMs are to near-synonymous Japanese prompt templates in zero-shot text classification.

Main Contribution

Systematic zero-shot evaluation of multiple LLMs on three Japanese datasets (MARC-ja, JNLI, JSTS) using five near-synonymous prompt templates.

Quantified prompt sensitivity: per-model accuracy and standard deviation across templates, including cases where accuracy falls dramatically.

Key Findings

GPT-4 accuracy on the JNLI task varied from 25.44% to 49.21% across near-synonymous templates.

Numbersrange 25.44%–49.21% (SD 9.56)

Practical UseDo not rely on a single prompt wording for critical Japanese tasks; validate multiple templates and report worst-case accuracy.

Evidence RefTable 3 (JNLI row, GPT-4)

On MARC-ja (binary sentiment), GPT-4 stayed high and stable overall but still varied between 86.2% and 90.1%.

Numbers86.2%–90.1% (SD 1.26)

Practical UseGPT-4 can be robust on simpler binary tasks, but even small template edits change results; monitor stability in production.

Evidence RefTable 3 (MARC-ja row, GPT-4)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	25.44%–49.21% (SD 9.56)	—	—	JNLI (zero-shot, 1000 samples)	Table 3, JNLI row for GPT-4	Table 3
Accuracy	86.2%–90.1% (SD 1.26)	—	—	MARC-ja (zero-shot, 1000 samples)	Table 3, MARC-ja row for GPT-4	Table 3

What To Try In 7 Days

Run an A/B across 5 prompt templates for each critical Japanese classification endpoint and record worst-case accuracy.

Add a small LoRA or adapter trained on in-language data and compare performance versus base model.

Use an output-label extraction check and monitor per-template SD; block templates with high variance from automation.

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

JGLUE (Kurihara et al., 2022) referenced in paper

Risks & Boundaries

Limitations

Tested only on Japanese; other minor languages not evaluated.

Only zero-shot classification was tested; no few-shot or fine-tuning experiments.

When Not To Use

Do not assume a single prompt will generalize across models or languages.

Avoid deploying without template-robustness testing for multi-class or high-stakes tasks.

Failure Modes

Large accuracy swings when changing honorifics, punctuation, or sentence ordering.

Smaller models may ignore instruction-style prompts and echo input, breaking label extraction.

Core Entities

Models

T5-base-japaneseLLaMA-7BLoRALLaMA-13BGPT-3.5-TurboGPT-4

Metrics

Accuracystandard deviationabsolute deviation

Datasets

MARC-jaJNLIJSTSJGLUE (benchmark collection)

Benchmarks

JGLUE

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

GPT-4 accuracy on the JNLI task varied from 25.44% to 49.21% across near-synonymous templates.

On MARC-ja (binary sentiment), GPT-4 stayed high and stable overall but still varied between 86.2% and 90.1%.

Results

What To Try In 7 Days

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Use LLMs to synthesize context examples and cut expert annotation by ~40–60% for biomedical entity linking

Key finding

LLM judges are prompt‑sensitive and internally noisy; here's a explainable toolkit to measure and de-noise them

Key finding

SCORE: report accuracy ranges and consistency, not just one score

Key finding

Open-source, reproducible benchmark that compares 10+ LLMs on 20+ tasks and traces the path from GPT-3 to GPT-4

Key finding

KemenkeuGPT: a LangChain+RAG LLM for Indonesian finance that raised accuracy from 35% to 61%

Key finding