Small wording or punctuation changes in Japanese prompts can cut LLM accuracy by half; model and language data matter.

May 15, 20237 min

Overview

Production Readiness

0.4

Novelty Score

0.5

Cost Impact Score

0.3

Citation Count

5

Authors

Chengguang Gan, Tatsunori Mori

Links

Abstract / PDF

Why It Matters For Business

Small prompt wording or punctuation changes can drastically change accuracy on non-English tasks; companies must test prompt variants and consider light in-language adapters to avoid surprise drops in production.

Summary TLDR

The authors test zero-shot prompt templates on Japanese text classification (MARC-ja, JNLI, JSTS). Results show large sensitivity: GPT-4 is very accurate on some prompts but can drop ~24 percentage points on others for the same task. A LLaMA-7B model adapted with a Sino-Japanese LoRA adapter matches or exceeds larger models on some prompts. Key takeaway: test multiple prompt phrasings and consider lightweight in-language adaptation before deploying LLMs for non-major languages.

Problem Statement

Prompt wording and small grammatical changes can strongly change LLM outputs. Little is known about this sensitivity for non-dominant languages like Japanese. The paper asks how robust current LLMs are to near-synonymous Japanese prompt templates in zero-shot text classification.

Main Contribution

Systematic zero-shot evaluation of multiple LLMs on three Japanese datasets (MARC-ja, JNLI, JSTS) using five near-synonymous prompt templates.

Quantified prompt sensitivity: per-model accuracy and standard deviation across templates, including cases where accuracy falls dramatically.

Showed that light in-language adaptation (LLaMA-7B with LoRA trained on Chinese/Japanese data) can substantially improve performance on Japanese prompts.

Documented practical limits and suggested that prompt robustness for minor languages needs more work.

Key Findings

GPT-4 accuracy on the JNLI task varied from 25.44% to 49.21% across near-synonymous templates.

Numbersrange 25.44%–49.21% (SD 9.56)

On MARC-ja (binary sentiment), GPT-4 stayed high and stable overall but still varied between 86.2% and 90.1%.

Numbers86.2%–90.1% (SD 1.26)

LLaMA-7B-LoRA (trained with Chinese/Japanese data) improved Prompt-1 accuracy over base LLaMA-7B by ~28.1 points on MARC-ja.

NumbersLLaMA-7B 54.3% vs LLaMA-7B-LoRA 82.4% (Δ ≈ +28.1)

Small PLM (T5-base) often failed to follow the Q&A prompt format and got near-random or very low accuracy on tasks.

NumbersMARC-ja templates near 3%–49.8% with high deviation; many other tasks near 0%–1.7%

Results

Accuracy

Value25.44%–49.21% (SD 9.56)

Accuracy

Value86.2%–90.1% (SD 1.26)

Accuracy

Value54.3% vs 82.4% (LoRA +28.1)

BaselineLLaMA-7B 54.3%

T5-base performance on classification prompts

Valueoften ≤3% to ~49.8% with high deviation

Who Should Care

What To Try In 7 Days

Run an A/B across 5 prompt templates for each critical Japanese classification endpoint and record worst-case accuracy.

Add a small LoRA or adapter trained on in-language data and compare performance versus base model.

Use an output-label extraction check and monitor per-template SD; block templates with high variance from automation.

Reproducibility

Data Urls

  • JGLUE (Kurihara et al., 2022) referenced in paper

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Tested only on Japanese; other minor languages not evaluated.
  • Only zero-shot classification was tested; no few-shot or fine-tuning experiments.
  • Subset of datasets used (1000 samples each) due to resource limits.
  • Model selection is representative but not exhaustive.

When Not To Use

  • Do not assume a single prompt will generalize across models or languages.
  • Avoid deploying without template-robustness testing for multi-class or high-stakes tasks.
  • Do not extrapolate these numeric results to tasks beyond sentence classification.

Failure Modes

  • Large accuracy swings when changing honorifics, punctuation, or sentence ordering.
  • Smaller models may ignore instruction-style prompts and echo input, breaking label extraction.
  • Adapter-trained models can still fail on some prompt variants, causing unexpected drops.

Core Entities

Models

  • T5-base-japanese
  • LLaMA-7B
  • LoRA
  • LLaMA-13B
  • GPT-3.5-Turbo
  • GPT-4

Metrics

  • Accuracy
  • standard deviation
  • absolute deviation

Datasets

  • MARC-ja
  • JNLI
  • JSTS
  • JGLUE (benchmark collection)

Benchmarks

  • JGLUE