Overview
Production Readiness
0.7
Novelty Score
0.55
Cost Impact Score
0.8
Citation Count
0
Why It Matters For Business
You can cheaply and deterministically evaluate short Japanese Q&A outputs without repeated human or expensive LLM judging, cutting evaluation cost and speeding model iteration.
Summary TLDR
The authors build a judge-free benchmark for short, single-turn Q&A in Japanese. They create 50 curriculum-based questions, generate large reference answer sets using three high-capacity Japanese LLMs, and score models with three deterministic metrics (Fluency, Truthfulness, Helpfulness) computed from character n-grams and simple rules. The benchmark correlates very strongly with GPT-4o judging (r=0.9896) while avoiding costly LLM-as-judge runs and human labeling. The method is fast, deterministic, and best suited to short, factual Q&A in Japanese, not open creative tasks.
Problem Statement
Open-ended text evaluation typically needs humans or an LLM-as-judge, both costly and variable. The paper asks: can we evaluate short open-ended answers without judges by using distributional clues (n-grams) to detect fluent, truthful, and helpful outputs?
Main Contribution
A judge-free benchmark for short Q&A in Japanese using character n-gram statistics and rule checks.
A pipeline to build large reference answer sets (1.5B generated responses refined to 1,000 per question).
Three deterministic metrics—Fluency, Truthfulness, Helpfulness—that together strongly correlate with GPT-4o judging.
Key Findings
Benchmark scores correlate very highly with GPT-4o judge scores.
Reference-set construction is stable across source models.
The benchmark aligns reasonably with existing Japanese leaderboards.
Large-scale generation was used to build references before filtering.
Results
Correlation with GPT-4o LLM-as-a-judge
Reference-set stability
Agreement with other Japanese benchmarks
Example model score (openai/gpt-4o)
Who Should Care
What To Try In 7 Days
Run the pfgen-bench code on new Japanese QA models to get reproducible Fluency/Truthfulness/Helpfulness scores.
Compare your model to public scores (Table 1) to identify weak areas, e.g., helpfulness.
Use the benchmark as a cheap smoke test before costly human or GPT-judge evaluations.
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Designed for short, single-turn Q&A in Japanese; character n-grams were chosen for Japanese specifics.
- Helpfulness relies on manually crafted keyword rules per question, which need maintenance.
- Reference set construction requires large up-front compute (1.5B generations) and depends on high-quality source models.
- N-gram statistics may fail to capture deep semantic correctness or long-form, multi-turn, or creative tasks.
When Not To Use
- Long-form creative writing or idea generation without clear answer spaces.
- Multi-turn conversational agents where context and coherence matter beyond n-grams.
- Non-Japanese languages without adapting token/granularity choices and reference sets.
Failure Modes
- High overlap with common phrasing can inflate scores even if factual detail is wrong.
- Very high-performing models can exceed the reference manifold, complicating interpretation.
- Short length constraints and truncation rules may reward brevity over full explanations.
- Manual helpfulness rules can miss valid alternative phrasings, lowering scores unfairly.
Core Entities
Models
- stockmark-100b
- pfnet/plamo-100b
- tokyotech-llm/Swallow-MX-8x7b-NVE-v0.1
- openai/gpt-4o
- openai/gpt-4
- anthropic/claude-3-5-sonnet-20240620
- tokyotech-llm/Swallow-70b-NVE-instruct-hf
- meta-llama/Meta-Llama-3.1-405B
Metrics
- Fluency (character 1-10 n-gram overlap)
- Truthfulness (proportion of 3-grams >=0.5%)
- Helpfulness (keyword/rule coverage)
Datasets
- pfgen-bench (50 Japanese Q&A questions, each with 1,000 reference answers)
Benchmarks
- Nejumi LLM Leaderboard
- Japanese MT-Bench
- LLM-as-a-judge (GPT-4o)

