Auto-update benchmarks with two LLM-driven strategies to reduce leakage and tune difficulty

Overview

Decision SnapshotNeeds Validation

Methods are practical and validated across many models, but cost, reliance on LLM backbones, and partial human checks limit immediate turnkey deployment.

Citations0

Evidence Strength0.70

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 0/5

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 50%

Authors

Jiahao Ying, Yixin Cao, Yushi Bai, Qianru Sun, Bo Wang, Wei Tang, Zhaojun Ding, Yizhe Yang, Xuanjing Huang, Shuicheng Yan

Links

Abstract / PDF

Why It Matters For Business

Automated updates let teams keep benchmarks fresh without heavy human labor, reduce false confidence from leaked test data, and tune question difficulty to better compare model releases.

Who Should Care

ML Engineer Product Manager CTO Data Scientist

Summary TLDR

The paper builds an automated pipeline to refresh open evaluation datasets using two LLM-based strategies: mimicking (generate similar unseen samples) and extending (generate new questions across Bloom cognitive levels). Experiments on MMLU and BIG-Bench with 11 models show the updates are stable across runs, reduce performance overestimation caused by benchmark leakage, and let you tune dataset difficulty. Human checks report high fluency and accuracy for generated items. The code/data release is not specified.

Problem Statement

Public benchmarks get leaked or become too easy as LLMs scale. Manually re-curating test sets is slow and costly. The paper asks: can we automatically update datasets to (1) stay unseen by models, (2) remain stable across regenerations, and (3) let evaluators control difficulty?

Main Contribution

Two automated dataset-update strategies: mimicking (make similar unseen variants) and extending (generate questions at different cognitive levels using Bloom's taxonomy).

Systematic experiments on updated MMLU and BIG-Bench showing update stability and reduced overestimation from leakage.

Key Findings

Mimicked datasets produce consistent evaluation scores across multiple regenerations.

Numbersstd dev 0–3% across four mimicked runs (zero-shot scores)

Practical UseYou can re-run mimicked updates repeatedly without changing model rankings; use mimicking when you need fast, low-cost refreshes.

Evidence RefSection 3.2; Table 12 & Table 13

Mimicking helps reduce overestimation caused by training/test leakage.

Numbersoriginal vs mimicked score differences avg ≈5%, max ≈18% on evaluated tasks

Practical UseIf a model was likely exposed to test items, evaluate it on mimicked sets to avoid inflated scores.

Evidence RefSection 3.2 & 3.3; Figures 2–3; Table 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Stability of mimicked updates	std dev 0–3% across four generations (zero-shot)	original dataset single copy	—	Selected BIG-Bench & MMLU tasks	Section 3.2; Table 12 & Table 13	Table 12/13
Human quality checks (mimic)	Fluency 94.7% (agreement 95.7%); Coherence 94.4% (94.0% agreement)	—	—	120 random mimicked samples	Section 2.4; Table 8	Table 8

What To Try In 7 Days

Generate a small mimicked variant of one internal test split and re-run CI to check for score shifts.

If leakage is suspected, run the extend pipeline on the troubled subset and compare finetuned boosts.

Add a quick human spot-check (50 samples) for fluency and answer correctness before trusting new test results.

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

Filtering out generated samples with incorrect answers can bias mimicked sample difficulty.

Difficulty control is coarse; more granular difficulty metrics need research.

When Not To Use

When you need fully human-crafted, high-stakes evaluation items (e.g., legal or regulated decisions).

When domain expertise or external specialized knowledge must be embedded into each item.

Failure Modes

Residual leakage if generation accidentally reproduces public training content.

Judge bias when using the same model family to generate and to evaluate answers.

Core Entities

Models

GPT-4ChatGPT (gpt-3.5-turbo)Claude-2Claude-3GeminiLlama-2-7b-chatLlama-2-13b-chatLlama-3-8b-InstructionMistral-7B-InstructMixtral-8x7BYi-6b-chatYi-34b-chat

Metrics

Exact Match (EM)Full-mark rate (LLM judgment)AccuracyStandard deviation across regenerations

Datasets

MMLUBIG-Bench

Benchmarks

MMLU (selected tasks)BIG-Bench (selected tasks)

Context Entities

Models

Other open-source and closed-source LLMs used for baselines and generation

Metrics

Meteor (used for contamination similarity)

Datasets

Public web corpora, IFT datasets for contamination checks (mentioned sources)

Benchmarks

HellaSwag, ARC, CommonsenseQA (referenced in related work)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Mimicked datasets produce consistent evaluation scores across multiple regenerations.

Mimicking helps reduce overestimation caused by training/test leakage.

Results

What To Try In 7 Days

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

A weekly-updated, contamination-free medical benchmark plus automated rubrics that align better with physicians than LLM-as-a-judge

Key finding

When synthetic training data and LLM evaluators are related, evaluators unfairly favor the student models

Key finding

Ko-H5 and an open Korean LLM leaderboard: private tests, new Korean tasks, and when benchmarks stop helping

Key finding

TreeEval: benchmark-free LLM evaluation via LLM examiner and tree planning

Key finding

Small prompt or format changes can reorder LLM leaderboards by many ranks

Key finding