Small wording or punctuation changes in Japanese prompts can cut LLM accuracy by half; model and language data matter.

May 15, 20237 min

Overview

Decision SnapshotNeeds Validation

Paper gives clear zero-shot numeric comparisons across models and prompts, but uses limited samples, only Japanese tasks, and no code release, so conclusions are directional not definitive.

Citations5

Evidence Strength0.60

Confidence0.60

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 1/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 30%

Production readiness: 40%

Novelty: 50%

Authors

Chengguang Gan, Tatsunori Mori

Links

Abstract / PDF / Data

Why It Matters For Business

Small prompt wording or punctuation changes can drastically change accuracy on non-English tasks; companies must test prompt variants and consider light in-language adapters to avoid surprise drops in production.

Who Should Care

Summary TLDR

The authors test zero-shot prompt templates on Japanese text classification (MARC-ja, JNLI, JSTS). Results show large sensitivity: GPT-4 is very accurate on some prompts but can drop ~24 percentage points on others for the same task. A LLaMA-7B model adapted with a Sino-Japanese LoRA adapter matches or exceeds larger models on some prompts. Key takeaway: test multiple prompt phrasings and consider lightweight in-language adaptation before deploying LLMs for non-major languages.

Problem Statement

Prompt wording and small grammatical changes can strongly change LLM outputs. Little is known about this sensitivity for non-dominant languages like Japanese. The paper asks how robust current LLMs are to near-synonymous Japanese prompt templates in zero-shot text classification.

Main Contribution

Systematic zero-shot evaluation of multiple LLMs on three Japanese datasets (MARC-ja, JNLI, JSTS) using five near-synonymous prompt templates.

Quantified prompt sensitivity: per-model accuracy and standard deviation across templates, including cases where accuracy falls dramatically.

Key Findings

GPT-4 accuracy on the JNLI task varied from 25.44% to 49.21% across near-synonymous templates.

Numbersrange 25.44%–49.21% (SD 9.56)

Practical UseDo not rely on a single prompt wording for critical Japanese tasks; validate multiple templates and report worst-case accuracy.

Evidence RefTable 3 (JNLI row, GPT-4)

On MARC-ja (binary sentiment), GPT-4 stayed high and stable overall but still varied between 86.2% and 90.1%.

Numbers86.2%–90.1% (SD 1.26)

Practical UseGPT-4 can be robust on simpler binary tasks, but even small template edits change results; monitor stability in production.

Evidence RefTable 3 (MARC-ja row, GPT-4)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy25.44%–49.21% (SD 9.56)JNLI (zero-shot, 1000 samples)Table 3, JNLI row for GPT-4Table 3
Accuracy86.2%–90.1% (SD 1.26)MARC-ja (zero-shot, 1000 samples)Table 3, MARC-ja row for GPT-4Table 3

What To Try In 7 Days

Run an A/B across 5 prompt templates for each critical Japanese classification endpoint and record worst-case accuracy.

Add a small LoRA or adapter trained on in-language data and compare performance versus base model.

Use an output-label extraction check and monitor per-template SD; block templates with high variance from automation.

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

JGLUE (Kurihara et al., 2022) referenced in paper

Risks & Boundaries

Limitations

Tested only on Japanese; other minor languages not evaluated.

Only zero-shot classification was tested; no few-shot or fine-tuning experiments.

When Not To Use

Do not assume a single prompt will generalize across models or languages.

Avoid deploying without template-robustness testing for multi-class or high-stakes tasks.

Failure Modes

Large accuracy swings when changing honorifics, punctuation, or sentence ordering.

Smaller models may ignore instruction-style prompts and echo input, breaking label extraction.

Core Entities

Models

T5-base-japaneseLLaMA-7BLoRALLaMA-13BGPT-3.5-TurboGPT-4

Metrics

Accuracystandard deviationabsolute deviation

Datasets

MARC-jaJNLIJSTSJGLUE (benchmark collection)

Benchmarks

JGLUE