Auto-update benchmarks with two LLM-driven strategies to reduce leakage and tune difficulty

February 19, 20247 min

Overview

Decision SnapshotNeeds Validation

Methods are practical and validated across many models, but cost, reliance on LLM backbones, and partial human checks limit immediate turnkey deployment.

Citations0

Evidence Strength0.70

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 0/5

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 50%

Authors

Jiahao Ying, Yixin Cao, Yushi Bai, Qianru Sun, Bo Wang, Wei Tang, Zhaojun Ding, Yizhe Yang, Xuanjing Huang, Shuicheng Yan

Links

Abstract / PDF

Why It Matters For Business

Automated updates let teams keep benchmarks fresh without heavy human labor, reduce false confidence from leaked test data, and tune question difficulty to better compare model releases.

Who Should Care

Summary TLDR

The paper builds an automated pipeline to refresh open evaluation datasets using two LLM-based strategies: mimicking (generate similar unseen samples) and extending (generate new questions across Bloom cognitive levels). Experiments on MMLU and BIG-Bench with 11 models show the updates are stable across runs, reduce performance overestimation caused by benchmark leakage, and let you tune dataset difficulty. Human checks report high fluency and accuracy for generated items. The code/data release is not specified.

Problem Statement

Public benchmarks get leaked or become too easy as LLMs scale. Manually re-curating test sets is slow and costly. The paper asks: can we automatically update datasets to (1) stay unseen by models, (2) remain stable across regenerations, and (3) let evaluators control difficulty?

Main Contribution

Two automated dataset-update strategies: mimicking (make similar unseen variants) and extending (generate questions at different cognitive levels using Bloom's taxonomy).

Systematic experiments on updated MMLU and BIG-Bench showing update stability and reduced overestimation from leakage.

Key Findings

Mimicked datasets produce consistent evaluation scores across multiple regenerations.

Numbersstd dev 03% across four mimicked runs (zero-shot scores)

Practical UseYou can re-run mimicked updates repeatedly without changing model rankings; use mimicking when you need fast, low-cost refreshes.

Evidence RefSection 3.2; Table 12 & Table 13

Mimicking helps reduce overestimation caused by training/test leakage.

Numbersoriginal vs mimicked score differences avg ≈5%, max ≈18% on evaluated tasks

Practical UseIf a model was likely exposed to test items, evaluate it on mimicked sets to avoid inflated scores.

Evidence RefSection 3.2 & 3.3; Figures 2–3; Table 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Stability of mimicked updatesstd dev 03% across four generations (zero-shot)original dataset single copySelected BIG-Bench & MMLU tasksSection 3.2; Table 12 & Table 13Table 12/13
Human quality checks (mimic)Fluency 94.7% (agreement 95.7%); Coherence 94.4% (94.0% agreement)120 random mimicked samplesSection 2.4; Table 8Table 8

What To Try In 7 Days

Generate a small mimicked variant of one internal test split and re-run CI to check for score shifts.

If leakage is suspected, run the extend pipeline on the troubled subset and compare finetuned boosts.

Add a quick human spot-check (50 samples) for fluency and answer correctness before trusting new test results.

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

Filtering out generated samples with incorrect answers can bias mimicked sample difficulty.

Difficulty control is coarse; more granular difficulty metrics need research.

When Not To Use

When you need fully human-crafted, high-stakes evaluation items (e.g., legal or regulated decisions).

When domain expertise or external specialized knowledge must be embedded into each item.

Failure Modes

Residual leakage if generation accidentally reproduces public training content.

Judge bias when using the same model family to generate and to evaluate answers.

Core Entities

Models

GPT-4ChatGPT (gpt-3.5-turbo)Claude-2Claude-3GeminiLlama-2-7b-chatLlama-2-13b-chatLlama-3-8b-InstructionMistral-7B-InstructMixtral-8x7BYi-6b-chatYi-34b-chat

Metrics

Exact Match (EM)Full-mark rate (LLM judgment)AccuracyStandard deviation across regenerations

Datasets

MMLUBIG-Bench

Benchmarks

MMLU (selected tasks)BIG-Bench (selected tasks)

Context Entities

Models

Other open-source and closed-source LLMs used for baselines and generation

Metrics

Meteor (used for contamination similarity)

Datasets

Public web corpora, IFT datasets for contamination checks (mentioned sources)

Benchmarks

HellaSwag, ARC, CommonsenseQA (referenced in related work)