SCORE: report accuracy ranges and consistency, not just one score

February 28, 20256 min

Overview

Production Readiness

0.7

Novelty Score

0.5

Cost Impact Score

0.6

Citation Count

0

Authors

Grigor Nalbandyan, Rima Shahbazyan, Evelina Bakhturina

Links

Abstract / PDF

Why It Matters For Business

Models that report a single accuracy number can fail unpredictably in production; measuring accuracy ranges and consistency reveals reliability risks before deployment.

Summary TLDR

SCORE is an open evaluation framework that measures how stable LLM answers are when prompts, answer order, or sampling seeds change. The paper runs multi-prompt, choice-order, and non-greedy-seed tests on three factual benchmarks (MMLU‑Pro, AGIEval, MATH) and many open models. Key takeaway: single-point accuracy hides real-world fragility — accuracy can swing by several percentage points and consistency rates are often well below 100%. The authors release code and a public robustness leaderboard.

Problem Statement

Standard LLM reports quote one tuned accuracy per dataset. That hides how much real outputs change when prompts are paraphrased, choices are reordered, or sampling seeds vary. Practitioners need a simple, repeatable way to measure both accuracy range and prediction consistency so they can judge reliability in production.

Main Contribution

Design and release SCORE, an open framework to measure non-adversarial robustness by repeating evaluations across prompts, choice orders, and random seeds.

Run SCORE on MMLU‑Pro, AGIEval, and MATH across many open LLMs and show that accuracy ranges and consistency rates vary widely.

Show that higher accuracy or larger model size does not always mean more stable predictions; publish code and a public leaderboard to track robustness.

Key Findings

Single reported accuracy hides instability.

NumbersMMLU‑Pro accuracy range up to 15.2% across prompts

Prompt wording changes accuracy noticeably.

NumbersPrompt paraphrases cause up to ~10% accuracy drop (abstract) and up to 15.2% observed

Changing choice order shifts accuracy.

NumbersChoice order causes 4–13.5% swings on MMLU‑Pro and 2–7.5% on AGIEval (up to 29.2% for Mistral‑12B)

Sampling randomness affects outputs even when accuracy holds.

NumbersNon-greedy seeds: accuracy stable but CR can be low (e.g., Llama‑3.1 8B CR 54.4% on MMLU‑Pro)

Model size is not a reliable proxy for stability.

NumbersMistral Large 123B and Llama‑3.1 70B show similar CR (~74%) despite size differences

Results

Accuracy

Value77.0% [75.3, 77.9]

Aggregated consistency rate (Llama-3.1 405B)

Value87.3%

Accuracy

Valueup to 15.2%

Accuracy

Value4–13.5% on MMLU‑Pro; 2–7.5% on AGIEval

MATH maximum consistency

Value69.8% CR (prompt robustness)

Who Should Care

What To Try In 7 Days

Run SCORE-style tests on your model: 8–10 prompt paraphrases, a few choice-order permutations, and 3–5 non-greedy seeds.

Report min/max accuracy and consistency rate alongside mean accuracy in internal model cards.

If consistency is low, add deterministic decoding or ensemble multiple responses for critical factual outputs.

Optimization Features

Infra Optimization

  • uses NVIDIA A100 80GB nodes; TRT-LLM conversion

Inference Optimization

  • converts models to TRT-LLM for faster evaluation

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Benchmark limited to three factual datasets (MMLU‑Pro, AGIEval, MATH); creative or subjective tasks not covered.
  • Heavy focus on MCQs makes parsing and stability easier but narrows scope.
  • Possible dataset contamination risks remain; authors note public datasets may be in training data.
  • Computational cost grows with more repeats and larger models.

When Not To Use

  • When you need adversarial robustness (SCORE uses non-adversarial perturbations).
  • For subjective or creative tasks where 'consistency' is ill-defined.
  • If you cannot afford repeated evaluations across prompts and seeds.

Failure Modes

  • Models can change predictions without score change (switching between incorrect answers).
  • Position bias in MCQs can inflate or deflate scores depending on option layout.
  • Sampling randomness (seed) can produce different answers even with similar accuracy.

Core Entities

Models

  • Llama-3.1-405B
  • Llama-3.1-70B
  • Llama-3.1-8B
  • Llama-3-70B
  • Mistral-Large-123B
  • Mistral-Nemo-12B
  • Qwen-2-72B
  • Qwen-2-7B
  • Yi-1.5-34B

Metrics

  • Accuracy
  • consistency rate (CR)

Datasets

  • MMLU-Pro
  • AGIEval
  • MATH

Benchmarks

  • SCORE
  • MMLU-Pro
  • AGIEval
  • MATH