Overview
The benchmark is ready as a cheap, deterministic evaluation for short Japanese Q&A; evidence shows very high correlations with GPT-4o and reasonable alignment with other leaderboards.
Citations0
Evidence Strength0.80
Confidence0.80
Risk Signals11
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 1/4
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 80%
Production readiness: 70%
Novelty: 55%
Why It Matters For Business
You can cheaply and deterministically evaluate short Japanese Q&A outputs without repeated human or expensive LLM judging, cutting evaluation cost and speeding model iteration.
Who Should Care
Summary TLDR
The authors build a judge-free benchmark for short, single-turn Q&A in Japanese. They create 50 curriculum-based questions, generate large reference answer sets using three high-capacity Japanese LLMs, and score models with three deterministic metrics (Fluency, Truthfulness, Helpfulness) computed from character n-grams and simple rules. The benchmark correlates very strongly with GPT-4o judging (r=0.9896) while avoiding costly LLM-as-judge runs and human labeling. The method is fast, deterministic, and best suited to short, factual Q&A in Japanese, not open creative tasks.
Problem Statement
Open-ended text evaluation typically needs humans or an LLM-as-judge, both costly and variable. The paper asks: can we evaluate short open-ended answers without judges by using distributional clues (n-grams) to detect fluent, truthful, and helpful outputs?
Main Contribution
A judge-free benchmark for short Q&A in Japanese using character n-gram statistics and rule checks.
A pipeline to build large reference answer sets (1.5B generated responses refined to 1,000 per question).
Key Findings
Benchmark scores correlate very highly with GPT-4o judge scores.
Reference-set construction is stable across source models.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Correlation with GPT-4o LLM-as-a-judge | r = 0.9896 | — | — | 50-question pfgen-bench set | Section 5.2; Figure 4 | Section 5.2, Fig.4 |
| Reference-set stability | r > 0.999 (ensemble vs single model) | — | — | comparison of reference sets | Section 5.1; Figure 3 | Section 5.1, Fig.3 |
What To Try In 7 Days
Run the pfgen-bench code on new Japanese QA models to get reproducible Fluency/Truthfulness/Helpfulness scores.
Compare your model to public scores (Table 1) to identify weak areas, e.g., helpfulness.
Use the benchmark as a cheap smoke test before costly human or GPT-judge evaluations.
Reproducibility
Risks & Boundaries
Limitations
Designed for short, single-turn Q&A in Japanese; character n-grams were chosen for Japanese specifics.
Helpfulness relies on manually crafted keyword rules per question, which need maintenance.
When Not To Use
Long-form creative writing or idea generation without clear answer spaces.
Multi-turn conversational agents where context and coherence matter beyond n-grams.
Failure Modes
High overlap with common phrasing can inflate scores even if factual detail is wrong.
Very high-performing models can exceed the reference manifold, complicating interpretation.

