A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Overview

Decision SnapshotNeeds Validation

The benchmark is ready as a cheap, deterministic evaluation for short Japanese Q&A; evidence shows very high correlations with GPT-4o and reasonable alignment with other leaderboards.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 1/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 55%

Authors

Kentaro Imajo, Masanori Hirano, Shuji Suzuki, Hiroaki Mikami

Links

Abstract / PDF / Code / Data

Why It Matters For Business

You can cheaply and deterministically evaluate short Japanese Q&A outputs without repeated human or expensive LLM judging, cutting evaluation cost and speeding model iteration.

Who Should Care

CTO Product Manager ML Engineer Data Scientist

Summary TLDR

The authors build a judge-free benchmark for short, single-turn Q&A in Japanese. They create 50 curriculum-based questions, generate large reference answer sets using three high-capacity Japanese LLMs, and score models with three deterministic metrics (Fluency, Truthfulness, Helpfulness) computed from character n-grams and simple rules. The benchmark correlates very strongly with GPT-4o judging (r=0.9896) while avoiding costly LLM-as-judge runs and human labeling. The method is fast, deterministic, and best suited to short, factual Q&A in Japanese, not open creative tasks.

Problem Statement

Open-ended text evaluation typically needs humans or an LLM-as-judge, both costly and variable. The paper asks: can we evaluate short open-ended answers without judges by using distributional clues (n-grams) to detect fluent, truthful, and helpful outputs?

Main Contribution

A judge-free benchmark for short Q&A in Japanese using character n-gram statistics and rule checks.

A pipeline to build large reference answer sets (1.5B generated responses refined to 1,000 per question).

Key Findings

Benchmark scores correlate very highly with GPT-4o judge scores.

NumbersPearson r = 0.9896 (Section 5.2; Fig.4)

Practical UseYou can approximate GPT-4-style judgments for short Japanese Q&A at much lower compute cost by using this n-gram benchmark.

Evidence RefSection 5.2, Figure 4

Reference-set construction is stable across source models.

NumbersCorrelation > 0.999 between ensemble and single-model reference sets (Fig.3)

Practical UseUsing one strong Japanese LLM to build the reference set produces similar evaluation results to an ensemble.

Evidence RefSection 5.1, Figure 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Correlation with GPT-4o LLM-as-a-judge	r = 0.9896	—	—	50-question pfgen-bench set	Section 5.2; Figure 4	Section 5.2, Fig.4
Reference-set stability	r > 0.999 (ensemble vs single model)	—	—	comparison of reference sets	Section 5.1; Figure 3	Section 5.1, Fig.3

What To Try In 7 Days

Run the pfgen-bench code on new Japanese QA models to get reproducible Fluency/Truthfulness/Helpfulness scores.

Compare your model to public scores (Table 1) to identify weak areas, e.g., helpfulness.

Use the benchmark as a cheap smoke test before costly human or GPT-judge evaluations.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/pfnet-research/pfgen-bench

Data URLs

https://github.com/pfnet-research/pfgen-bench

Risks & Boundaries

Limitations

Designed for short, single-turn Q&A in Japanese; character n-grams were chosen for Japanese specifics.

Helpfulness relies on manually crafted keyword rules per question, which need maintenance.

When Not To Use

Long-form creative writing or idea generation without clear answer spaces.

Multi-turn conversational agents where context and coherence matter beyond n-grams.

Failure Modes

High overlap with common phrasing can inflate scores even if factual detail is wrong.

Very high-performing models can exceed the reference manifold, complicating interpretation.

Core Entities

Models

stockmark-100bpfnet/plamo-100btokyotech-llm/Swallow-MX-8x7b-NVE-v0.1openai/gpt-4oopenai/gpt-4anthropic/claude-3-5-sonnet-20240620tokyotech-llm/Swallow-70b-NVE-instruct-hfmeta-llama/Meta-Llama-3.1-405B

Metrics

Fluency (character 1-10 n-gram overlap)Truthfulness (proportion of 3-grams >=0.5%)Helpfulness (keyword/rule coverage)

Datasets

pfgen-bench (50 Japanese Q&A questions, each with 1,000 reference answers)

Benchmarks

Nejumi LLM LeaderboardJapanese MT-BenchLLM-as-a-judge (GPT-4o)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Benchmark scores correlate very highly with GPT-4o judge scores.

Reference-set construction is stable across source models.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Make LLM judges reliable and cheaper: EIF and tuned PPI give tighter, valid evaluation with few labels

Key finding

Model rankings from LLM judges become less biased and get valid confidence intervals when you down-weight unreliable judges

Key finding

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

MCTS-Judge: Use Monte Carlo Tree Search at test time to double LLM judge accuracy on code tasks

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding