Auto-update benchmarks with two LLM-driven strategies to reduce leakage and tune difficulty

February 19, 20247 min

Overview

Production Readiness

0.6

Novelty Score

0.5

Cost Impact Score

0.6

Citation Count

0

Authors

Jiahao Ying, Yixin Cao, Yushi Bai, Qianru Sun, Bo Wang, Wei Tang, Zhaojun Ding, Yizhe Yang, Xuanjing Huang, Shuicheng Yan

Links

Abstract / PDF

Why It Matters For Business

Automated updates let teams keep benchmarks fresh without heavy human labor, reduce false confidence from leaked test data, and tune question difficulty to better compare model releases.

Summary TLDR

The paper builds an automated pipeline to refresh open evaluation datasets using two LLM-based strategies: mimicking (generate similar unseen samples) and extending (generate new questions across Bloom cognitive levels). Experiments on MMLU and BIG-Bench with 11 models show the updates are stable across runs, reduce performance overestimation caused by benchmark leakage, and let you tune dataset difficulty. Human checks report high fluency and accuracy for generated items. The code/data release is not specified.

Problem Statement

Public benchmarks get leaked or become too easy as LLMs scale. Manually re-curating test sets is slow and costly. The paper asks: can we automatically update datasets to (1) stay unseen by models, (2) remain stable across regenerations, and (3) let evaluators control difficulty?

Main Contribution

Two automated dataset-update strategies: mimicking (make similar unseen variants) and extending (generate questions at different cognitive levels using Bloom's taxonomy).

Systematic experiments on updated MMLU and BIG-Bench showing update stability and reduced overestimation from leakage.

A controllable difficulty knob via cognitive levels and seed popularity to produce more discriminative test sets.

Key Findings

Mimicked datasets produce consistent evaluation scores across multiple regenerations.

Numbersstd dev 0–3% across four mimicked runs (zero-shot scores)

Mimicking helps reduce overestimation caused by training/test leakage.

Numbersoriginal vs mimicked score differences avg ≈5%, max ≈18% on evaluated tasks

Extending (Bloom-based) fixes cases where mimicking fails and enables difficulty control.

Numbersextended splits show large performance spread (Sports 23.76%, Phys 14.04% across cognitive levels)

Generated samples pass human quality checks at high rates.

NumbersMimic fluency 94.7% (agreement 95.7%); Extend category accuracy 98.3% and eval consistency 90.8%

Results

Stability of mimicked updates

Valuestd dev 0–3% across four generations (zero-shot)

Baselineoriginal dataset single copy

Human quality checks (mimic)

ValueFluency 94.7% (agreement 95.7%); Coherence 94.4% (94.0% agreement)

Human quality checks (extend)

ValueFluency 98.3% / Category accuracy 98.3% / Eval consistency 90.8%

Extended data contamination (leakage) check

ValueClean rates: Algebra 154/154, Algos 320/320, Phys 79/80, Sports 824/824

Difficulty control effect

ValuePerformance spread across cognitive levels: Sports 23.76%, Phys 14.04%

Baselineoriginal (low spread)

Who Should Care

What To Try In 7 Days

Generate a small mimicked variant of one internal test split and re-run CI to check for score shifts.

If leakage is suspected, run the extend pipeline on the troubled subset and compare finetuned boosts.

Add a quick human spot-check (50 samples) for fluency and answer correctness before trusting new test results.

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Filtering out generated samples with incorrect answers can bias mimicked sample difficulty.
  • Difficulty control is coarse; more granular difficulty metrics need research.
  • Generation and evaluation rely on expensive LLM backbones, incurring time and cost limits.

When Not To Use

  • When you need fully human-crafted, high-stakes evaluation items (e.g., legal or regulated decisions).
  • When domain expertise or external specialized knowledge must be embedded into each item.
  • If you cannot afford LLM generation or manual spot-check costs.

Failure Modes

  • Residual leakage if generation accidentally reproduces public training content.
  • Judge bias when using the same model family to generate and to evaluate answers.
  • Label-distribution artifacts introduced by generation prompts (e.g., uniform label ranges).

Core Entities

Models

  • GPT-4
  • ChatGPT (gpt-3.5-turbo)
  • Claude-2
  • Claude-3
  • Gemini
  • Llama-2-7b-chat
  • Llama-2-13b-chat
  • Llama-3-8b-Instruction
  • Mistral-7B-Instruct
  • Mixtral-8x7B
  • Yi-6b-chat
  • Yi-34b-chat

Metrics

  • Exact Match (EM)
  • Full-mark rate (LLM judgment)
  • Accuracy
  • Standard deviation across regenerations

Datasets

  • MMLU
  • BIG-Bench

Benchmarks

  • MMLU (selected tasks)
  • BIG-Bench (selected tasks)

Context Entities

Models

  • Other open-source and closed-source LLMs used for baselines and generation

Metrics

  • Meteor (used for contamination similarity)

Datasets

  • Public web corpora, IFT datasets for contamination checks (mentioned sources)

Benchmarks

  • HellaSwag, ARC, CommonsenseQA (referenced in related work)