Overview
Production Readiness
0.4
Novelty Score
0.5
Cost Impact Score
0.3
Citation Count
5
Why It Matters For Business
Small prompt wording or punctuation changes can drastically change accuracy on non-English tasks; companies must test prompt variants and consider light in-language adapters to avoid surprise drops in production.
Summary TLDR
The authors test zero-shot prompt templates on Japanese text classification (MARC-ja, JNLI, JSTS). Results show large sensitivity: GPT-4 is very accurate on some prompts but can drop ~24 percentage points on others for the same task. A LLaMA-7B model adapted with a Sino-Japanese LoRA adapter matches or exceeds larger models on some prompts. Key takeaway: test multiple prompt phrasings and consider lightweight in-language adaptation before deploying LLMs for non-major languages.
Problem Statement
Prompt wording and small grammatical changes can strongly change LLM outputs. Little is known about this sensitivity for non-dominant languages like Japanese. The paper asks how robust current LLMs are to near-synonymous Japanese prompt templates in zero-shot text classification.
Main Contribution
Systematic zero-shot evaluation of multiple LLMs on three Japanese datasets (MARC-ja, JNLI, JSTS) using five near-synonymous prompt templates.
Quantified prompt sensitivity: per-model accuracy and standard deviation across templates, including cases where accuracy falls dramatically.
Showed that light in-language adaptation (LLaMA-7B with LoRA trained on Chinese/Japanese data) can substantially improve performance on Japanese prompts.
Documented practical limits and suggested that prompt robustness for minor languages needs more work.
Key Findings
GPT-4 accuracy on the JNLI task varied from 25.44% to 49.21% across near-synonymous templates.
On MARC-ja (binary sentiment), GPT-4 stayed high and stable overall but still varied between 86.2% and 90.1%.
LLaMA-7B-LoRA (trained with Chinese/Japanese data) improved Prompt-1 accuracy over base LLaMA-7B by ~28.1 points on MARC-ja.
Small PLM (T5-base) often failed to follow the Q&A prompt format and got near-random or very low accuracy on tasks.
Results
Accuracy
Accuracy
Accuracy
T5-base performance on classification prompts
Who Should Care
What To Try In 7 Days
Run an A/B across 5 prompt templates for each critical Japanese classification endpoint and record worst-case accuracy.
Add a small LoRA or adapter trained on in-language data and compare performance versus base model.
Use an output-label extraction check and monitor per-template SD; block templates with high variance from automation.
Reproducibility
Data Urls
- JGLUE (Kurihara et al., 2022) referenced in paper
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Tested only on Japanese; other minor languages not evaluated.
- Only zero-shot classification was tested; no few-shot or fine-tuning experiments.
- Subset of datasets used (1000 samples each) due to resource limits.
- Model selection is representative but not exhaustive.
When Not To Use
- Do not assume a single prompt will generalize across models or languages.
- Avoid deploying without template-robustness testing for multi-class or high-stakes tasks.
- Do not extrapolate these numeric results to tasks beyond sentence classification.
Failure Modes
- Large accuracy swings when changing honorifics, punctuation, or sentence ordering.
- Smaller models may ignore instruction-style prompts and echo input, breaking label extraction.
- Adapter-trained models can still fail on some prompt variants, causing unexpected drops.
Core Entities
Models
- T5-base-japanese
- LLaMA-7B
- LoRA
- LLaMA-13B
- GPT-3.5-Turbo
- GPT-4
Metrics
- Accuracy
- standard deviation
- absolute deviation
Datasets
- MARC-ja
- JNLI
- JSTS
- JGLUE (benchmark collection)
Benchmarks
- JGLUE

