A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

February 13, 20256 min

Overview

Production Readiness

0.7

Novelty Score

0.55

Cost Impact Score

0.8

Citation Count

0

Authors

Kentaro Imajo, Masanori Hirano, Shuji Suzuki, Hiroaki Mikami

Links

Abstract / PDF

Why It Matters For Business

You can cheaply and deterministically evaluate short Japanese Q&A outputs without repeated human or expensive LLM judging, cutting evaluation cost and speeding model iteration.

Summary TLDR

The authors build a judge-free benchmark for short, single-turn Q&A in Japanese. They create 50 curriculum-based questions, generate large reference answer sets using three high-capacity Japanese LLMs, and score models with three deterministic metrics (Fluency, Truthfulness, Helpfulness) computed from character n-grams and simple rules. The benchmark correlates very strongly with GPT-4o judging (r=0.9896) while avoiding costly LLM-as-judge runs and human labeling. The method is fast, deterministic, and best suited to short, factual Q&A in Japanese, not open creative tasks.

Problem Statement

Open-ended text evaluation typically needs humans or an LLM-as-judge, both costly and variable. The paper asks: can we evaluate short open-ended answers without judges by using distributional clues (n-grams) to detect fluent, truthful, and helpful outputs?

Main Contribution

A judge-free benchmark for short Q&A in Japanese using character n-gram statistics and rule checks.

A pipeline to build large reference answer sets (1.5B generated responses refined to 1,000 per question).

Three deterministic metrics—Fluency, Truthfulness, Helpfulness—that together strongly correlate with GPT-4o judging.

Key Findings

Benchmark scores correlate very highly with GPT-4o judge scores.

NumbersPearson r = 0.9896 (Section 5.2; Fig.4)

Reference-set construction is stable across source models.

NumbersCorrelation > 0.999 between ensemble and single-model reference sets (Fig.3)

The benchmark aligns reasonably with existing Japanese leaderboards.

NumbersCorrelation > 0.7 with Nejumi and Japanese MT-Bench (Section 5.3)

Large-scale generation was used to build references before filtering.

Numbers1.5 billion generated responses reduced to 1,000 per question (Section 3.2)

Results

Correlation with GPT-4o LLM-as-a-judge

Valuer = 0.9896

Reference-set stability

Valuer > 0.999 (ensemble vs single model)

Agreement with other Japanese benchmarks

Valuer > 0.7

Example model score (openai/gpt-4o)

ValueScore = 0.8615 (Fluency 0.919, Truthfulness 0.98, Helpfulness 0.686)

BaselineReference answer set score = 0.8494

Who Should Care

What To Try In 7 Days

Run the pfgen-bench code on new Japanese QA models to get reproducible Fluency/Truthfulness/Helpfulness scores.

Compare your model to public scores (Table 1) to identify weak areas, e.g., helpfulness.

Use the benchmark as a cheap smoke test before costly human or GPT-judge evaluations.

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Designed for short, single-turn Q&A in Japanese; character n-grams were chosen for Japanese specifics.
  • Helpfulness relies on manually crafted keyword rules per question, which need maintenance.
  • Reference set construction requires large up-front compute (1.5B generations) and depends on high-quality source models.
  • N-gram statistics may fail to capture deep semantic correctness or long-form, multi-turn, or creative tasks.

When Not To Use

  • Long-form creative writing or idea generation without clear answer spaces.
  • Multi-turn conversational agents where context and coherence matter beyond n-grams.
  • Non-Japanese languages without adapting token/granularity choices and reference sets.

Failure Modes

  • High overlap with common phrasing can inflate scores even if factual detail is wrong.
  • Very high-performing models can exceed the reference manifold, complicating interpretation.
  • Short length constraints and truncation rules may reward brevity over full explanations.
  • Manual helpfulness rules can miss valid alternative phrasings, lowering scores unfairly.

Core Entities

Models

  • stockmark-100b
  • pfnet/plamo-100b
  • tokyotech-llm/Swallow-MX-8x7b-NVE-v0.1
  • openai/gpt-4o
  • openai/gpt-4
  • anthropic/claude-3-5-sonnet-20240620
  • tokyotech-llm/Swallow-70b-NVE-instruct-hf
  • meta-llama/Meta-Llama-3.1-405B

Metrics

  • Fluency (character 1-10 n-gram overlap)
  • Truthfulness (proportion of 3-grams >=0.5%)
  • Helpfulness (keyword/rule coverage)

Datasets

  • pfgen-bench (50 Japanese Q&A questions, each with 1,000 reference answers)

Benchmarks

  • Nejumi LLM Leaderboard
  • Japanese MT-Bench
  • LLM-as-a-judge (GPT-4o)