Fine-tuned open-source LLMs can act as fast, accurate judges for other LLMs

October 26, 20237 min

Overview

Production Readiness

0.7

Novelty Score

0.4

Cost Impact Score

0.6

Citation Count

18

Authors

Lianghui Zhu, Xinggang Wang, Xinlong Wang

Links

Abstract / PDF

Why It Matters For Business

JudgeLM lets teams run fast, reproducible, and local automatic evaluations instead of slow human/API judging; this lowers cost and speeds model iteration while keeping judgments consistent.

Summary TLDR

The authors build JudgeLM: open-source LLMs (7B/13B/33B) fine-tuned on 100K GPT-4 judgments to act as automatic judges for open-ended LLM outputs. JudgeLM matches GPT-4 closely (up to ~90% agreement on their benchmark), runs orders of magnitude faster than prior open-source judges, and includes three simple fine-tuning tricks (swap augmentation, reference support, reference drop) to reduce position, knowledge, and format biases. The dataset (100K train, 5K val) and code are released for academic use.

Problem Statement

Existing metrics and benchmarks poorly capture quality in open-ended LLM outputs. Human or API-based judging is costly or non-reproducible. We need scalable, reproducible judges that mimic high-quality human-like evaluation.

Main Contribution

A large, checked judge dataset: 100K training seeds and 5K validation seeds with GPT-4 judgments and human re-checks.

JudgeLM: fine-tuned open-source judges at 7B/13B/33B that reach high agreement with GPT-4 and generalize to many judging modes.

Three practical fixes for judge fine-tuning—swap augmentation, reference support, reference drop—that cut position/knowledge/format biases.

Key Findings

Large fine-tuned JudgeLM reaches near-GPT-4 agreement on the authors' benchmark.

NumbersAgreement 90.06% (JudgeLM-33B, 100K finetune)

Small JudgeLM is extremely fast when run in parallel.

NumbersJudgeLM-7B judges 5k pairs in 3 minutes on 8 A100 GPUs vs baseline 6h40m

Swap augmentation reduces position bias and raises self-consistency.

NumbersConsistency +5.44% and bias-toward-1st reduced from 19.83% to 15.34%

Reference support and reference drop reduce format and knowledge mismatches.

NumbersMatching-format: agreement 80.15% (ft w/ ref); ref-drop handling yields 80.35% on mixed formats

Results

Agreement with GPT-4

Value90.06%

BaselineJudgeLM-33B at 100K finetune

End-to-end evaluation latency

Value3 minutes for 5k pairs

BaselinePandaLM baseline 6h40m

Zero-shot agreement on PandaLM test (w/ human labels)

Value75.18% (JudgeLM-33B)

BaselineGPT-4 reported 66.47% on same split

Who Should Care

What To Try In 7 Days

Run JudgeLM-7B locally to replace GPT-4 API for internal A/B grading and save cost.

Fine-tune a small JudgeLM with swap augmentation to reduce position bias in pairwise comparisons.

Use reference support plus reference drop to make judges robust across with/without-reference evaluations.

Agent Features

Architectures

  • LLaMA / Vicuna based

Optimization Features

Infra Optimization

  • Parallel judge across 8 A100 GPUs for high throughput

Model Optimization

  • Fine-tuned specialist judges (7B/13B/33B)

System Optimization

  • Grade-then-judge-then-reason pipeline to avoid unnecessary generation

Training Optimization

  • Swap augmentation
  • Reference support
  • Reference drop

Inference Optimization

  • Skip optional reasoning step to speed evaluation
  • Parallel scoring across GPUs

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Training labels come from GPT-4; teacher biases may transfer into JudgeLM.
  • Dataset released only for academic use; commercial use may be restricted.
  • Performance drops on out-of-domain or knowledge-scarce tasks unless references provided.
  • Large models and large judge datasets are still costly to produce (authors spent ≈$4k on GPT-4 data).

When Not To Use

  • High-stakes evaluations requiring certified human review (legal, medical, safety-critical).
  • Commercial deployments if dataset license forbids it.
  • Tasks with novel knowledge not covered by training without reliable references.

Failure Modes

  • Position bias favoring first-listed answers if swap augmentation not used.
  • Knowledge bias when judge lacks domain facts and no reference is available.
  • Format bias when model is fine-tuned only with or only without references.
  • Overfitting to GPT-4 preferences that differ from specific human evaluators.

Core Entities

Models

  • JudgeLM-7B
  • JudgeLM-13B
  • JudgeLM-33B
  • Vicuna (base)
  • LLaMA / LLaMA2
  • GPT-4 (teacher)
  • GPT-3.5

Metrics

  • Agreement
  • Precision
  • Recall
  • F1
  • Consistency (swap)
  • Bias toward 1st
  • Bias toward 2nd
  • Delta bias

Datasets

  • JudgeLM dataset (100K train, 5K val)
  • PandaLM test
  • MM-Vet
  • ToxicChat
  • RewardBench subsets

Benchmarks

  • JudgeLM benchmark (authors' val set)
  • PandaLM benchmark
  • MM-Vet (multimodal)
  • ToxicChat (toxicity)

Context Entities

Models

  • PandaLM-7B
  • Auto-J-13B
  • InstructScore-7B
  • OpenAI Moderation
  • LLaVA

Datasets

  • Alpaca-GPT4
  • Dolly-15K
  • GPT4All-LAION
  • ShareGPT

Benchmarks

  • MT-bench (human agreement reference)