A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

February 13, 20256 min

Overview

Decision SnapshotNeeds Validation

The benchmark is ready as a cheap, deterministic evaluation for short Japanese Q&A; evidence shows very high correlations with GPT-4o and reasonable alignment with other leaderboards.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 1/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 55%

Authors

Kentaro Imajo, Masanori Hirano, Shuji Suzuki, Hiroaki Mikami

Links

Abstract / PDF / Code / Data

Why It Matters For Business

You can cheaply and deterministically evaluate short Japanese Q&A outputs without repeated human or expensive LLM judging, cutting evaluation cost and speeding model iteration.

Who Should Care

Summary TLDR

The authors build a judge-free benchmark for short, single-turn Q&A in Japanese. They create 50 curriculum-based questions, generate large reference answer sets using three high-capacity Japanese LLMs, and score models with three deterministic metrics (Fluency, Truthfulness, Helpfulness) computed from character n-grams and simple rules. The benchmark correlates very strongly with GPT-4o judging (r=0.9896) while avoiding costly LLM-as-judge runs and human labeling. The method is fast, deterministic, and best suited to short, factual Q&A in Japanese, not open creative tasks.

Problem Statement

Open-ended text evaluation typically needs humans or an LLM-as-judge, both costly and variable. The paper asks: can we evaluate short open-ended answers without judges by using distributional clues (n-grams) to detect fluent, truthful, and helpful outputs?

Main Contribution

A judge-free benchmark for short Q&A in Japanese using character n-gram statistics and rule checks.

A pipeline to build large reference answer sets (1.5B generated responses refined to 1,000 per question).

Key Findings

Benchmark scores correlate very highly with GPT-4o judge scores.

NumbersPearson r = 0.9896 (Section 5.2; Fig.4)

Practical UseYou can approximate GPT-4-style judgments for short Japanese Q&A at much lower compute cost by using this n-gram benchmark.

Evidence RefSection 5.2, Figure 4

Reference-set construction is stable across source models.

NumbersCorrelation > 0.999 between ensemble and single-model reference sets (Fig.3)

Practical UseUsing one strong Japanese LLM to build the reference set produces similar evaluation results to an ensemble.

Evidence RefSection 5.1, Figure 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Correlation with GPT-4o LLM-as-a-judger = 0.989650-question pfgen-bench setSection 5.2; Figure 4Section 5.2, Fig.4
Reference-set stabilityr > 0.999 (ensemble vs single model)comparison of reference setsSection 5.1; Figure 3Section 5.1, Fig.3

What To Try In 7 Days

Run the pfgen-bench code on new Japanese QA models to get reproducible Fluency/Truthfulness/Helpfulness scores.

Compare your model to public scores (Table 1) to identify weak areas, e.g., helpfulness.

Use the benchmark as a cheap smoke test before costly human or GPT-judge evaluations.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Designed for short, single-turn Q&A in Japanese; character n-grams were chosen for Japanese specifics.

Helpfulness relies on manually crafted keyword rules per question, which need maintenance.

When Not To Use

Long-form creative writing or idea generation without clear answer spaces.

Multi-turn conversational agents where context and coherence matter beyond n-grams.

Failure Modes

High overlap with common phrasing can inflate scores even if factual detail is wrong.

Very high-performing models can exceed the reference manifold, complicating interpretation.

Core Entities

Models

stockmark-100bpfnet/plamo-100btokyotech-llm/Swallow-MX-8x7b-NVE-v0.1openai/gpt-4oopenai/gpt-4anthropic/claude-3-5-sonnet-20240620tokyotech-llm/Swallow-70b-NVE-instruct-hfmeta-llama/Meta-Llama-3.1-405B

Metrics

Fluency (character 1-10 n-gram overlap)Truthfulness (proportion of 3-grams >=0.5%)Helpfulness (keyword/rule coverage)

Datasets

pfgen-bench (50 Japanese Q&A questions, each with 1,000 reference answers)

Benchmarks

Nejumi LLM LeaderboardJapanese MT-BenchLLM-as-a-judge (GPT-4o)