Overview
Paper gives clear zero-shot numeric comparisons across models and prompts, but uses limited samples, only Japanese tasks, and no code release, so conclusions are directional not definitive.
Citations5
Evidence Strength0.60
Confidence0.60
Risk Signals10
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 1/4
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 30%
Production readiness: 40%
Novelty: 50%
Why It Matters For Business
Small prompt wording or punctuation changes can drastically change accuracy on non-English tasks; companies must test prompt variants and consider light in-language adapters to avoid surprise drops in production.
Who Should Care
Summary TLDR
The authors test zero-shot prompt templates on Japanese text classification (MARC-ja, JNLI, JSTS). Results show large sensitivity: GPT-4 is very accurate on some prompts but can drop ~24 percentage points on others for the same task. A LLaMA-7B model adapted with a Sino-Japanese LoRA adapter matches or exceeds larger models on some prompts. Key takeaway: test multiple prompt phrasings and consider lightweight in-language adaptation before deploying LLMs for non-major languages.
Problem Statement
Prompt wording and small grammatical changes can strongly change LLM outputs. Little is known about this sensitivity for non-dominant languages like Japanese. The paper asks how robust current LLMs are to near-synonymous Japanese prompt templates in zero-shot text classification.
Main Contribution
Systematic zero-shot evaluation of multiple LLMs on three Japanese datasets (MARC-ja, JNLI, JSTS) using five near-synonymous prompt templates.
Quantified prompt sensitivity: per-model accuracy and standard deviation across templates, including cases where accuracy falls dramatically.
Key Findings
GPT-4 accuracy on the JNLI task varied from 25.44% to 49.21% across near-synonymous templates.
On MARC-ja (binary sentiment), GPT-4 stayed high and stable overall but still varied between 86.2% and 90.1%.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | 25.44%–49.21% (SD 9.56) | — | — | JNLI (zero-shot, 1000 samples) | Table 3, JNLI row for GPT-4 | Table 3 |
| Accuracy | 86.2%–90.1% (SD 1.26) | — | — | MARC-ja (zero-shot, 1000 samples) | Table 3, MARC-ja row for GPT-4 | Table 3 |
What To Try In 7 Days
Run an A/B across 5 prompt templates for each critical Japanese classification endpoint and record worst-case accuracy.
Add a small LoRA or adapter trained on in-language data and compare performance versus base model.
Use an output-label extraction check and monitor per-template SD; block templates with high variance from automation.
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Tested only on Japanese; other minor languages not evaluated.
Only zero-shot classification was tested; no few-shot or fine-tuning experiments.
When Not To Use
Do not assume a single prompt will generalize across models or languages.
Avoid deploying without template-robustness testing for multi-class or high-stakes tasks.
Failure Modes
Large accuracy swings when changing honorifics, punctuation, or sentence ordering.
Smaller models may ignore instruction-style prompts and echo input, breaking label extraction.

