Overview
Production Readiness
0.6
Novelty Score
0.5
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
Automated updates let teams keep benchmarks fresh without heavy human labor, reduce false confidence from leaked test data, and tune question difficulty to better compare model releases.
Summary TLDR
The paper builds an automated pipeline to refresh open evaluation datasets using two LLM-based strategies: mimicking (generate similar unseen samples) and extending (generate new questions across Bloom cognitive levels). Experiments on MMLU and BIG-Bench with 11 models show the updates are stable across runs, reduce performance overestimation caused by benchmark leakage, and let you tune dataset difficulty. Human checks report high fluency and accuracy for generated items. The code/data release is not specified.
Problem Statement
Public benchmarks get leaked or become too easy as LLMs scale. Manually re-curating test sets is slow and costly. The paper asks: can we automatically update datasets to (1) stay unseen by models, (2) remain stable across regenerations, and (3) let evaluators control difficulty?
Main Contribution
Two automated dataset-update strategies: mimicking (make similar unseen variants) and extending (generate questions at different cognitive levels using Bloom's taxonomy).
Systematic experiments on updated MMLU and BIG-Bench showing update stability and reduced overestimation from leakage.
A controllable difficulty knob via cognitive levels and seed popularity to produce more discriminative test sets.
Key Findings
Mimicked datasets produce consistent evaluation scores across multiple regenerations.
Mimicking helps reduce overestimation caused by training/test leakage.
Extending (Bloom-based) fixes cases where mimicking fails and enables difficulty control.
Generated samples pass human quality checks at high rates.
Results
Stability of mimicked updates
Human quality checks (mimic)
Human quality checks (extend)
Extended data contamination (leakage) check
Difficulty control effect
Who Should Care
What To Try In 7 Days
Generate a small mimicked variant of one internal test split and re-run CI to check for score shifts.
If leakage is suspected, run the extend pipeline on the troubled subset and compare finetuned boosts.
Add a quick human spot-check (50 samples) for fluency and answer correctness before trusting new test results.
Reproducibility
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Filtering out generated samples with incorrect answers can bias mimicked sample difficulty.
- Difficulty control is coarse; more granular difficulty metrics need research.
- Generation and evaluation rely on expensive LLM backbones, incurring time and cost limits.
When Not To Use
- When you need fully human-crafted, high-stakes evaluation items (e.g., legal or regulated decisions).
- When domain expertise or external specialized knowledge must be embedded into each item.
- If you cannot afford LLM generation or manual spot-check costs.
Failure Modes
- Residual leakage if generation accidentally reproduces public training content.
- Judge bias when using the same model family to generate and to evaluate answers.
- Label-distribution artifacts introduced by generation prompts (e.g., uniform label ranges).
Core Entities
Models
- GPT-4
- ChatGPT (gpt-3.5-turbo)
- Claude-2
- Claude-3
- Gemini
- Llama-2-7b-chat
- Llama-2-13b-chat
- Llama-3-8b-Instruction
- Mistral-7B-Instruct
- Mixtral-8x7B
- Yi-6b-chat
- Yi-34b-chat
Metrics
- Exact Match (EM)
- Full-mark rate (LLM judgment)
- Accuracy
- Standard deviation across regenerations
Datasets
- MMLU
- BIG-Bench
Benchmarks
- MMLU (selected tasks)
- BIG-Bench (selected tasks)
Context Entities
Models
- Other open-source and closed-source LLMs used for baselines and generation
Metrics
- Meteor (used for contamination similarity)
Datasets
- Public web corpora, IFT datasets for contamination checks (mentioned sources)
Benchmarks
- HellaSwag, ARC, CommonsenseQA (referenced in related work)

