SCORE: report accuracy ranges and consistency, not just one score

Overview

Decision SnapshotReady For Pilot

SCORE gives practical, repeatable checks (prompts, order, seeds) that teams can run quickly; evidence is strong on three factual datasets but not on creative tasks.

Citations0

Evidence Strength0.90

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 0/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 50%

Authors

Grigor Nalbandyan, Rima Shahbazyan, Evelina Bakhturina

Links

Abstract / PDF / Code

Why It Matters For Business

Models that report a single accuracy number can fail unpredictably in production; measuring accuracy ranges and consistency reveals reliability risks before deployment.

Who Should Care

Product Manager ML Engineer Engineering Lead Data Scientist CTO

Summary TLDR

SCORE is an open evaluation framework that measures how stable LLM answers are when prompts, answer order, or sampling seeds change. The paper runs multi-prompt, choice-order, and non-greedy-seed tests on three factual benchmarks (MMLU‑Pro, AGIEval, MATH) and many open models. Key takeaway: single-point accuracy hides real-world fragility — accuracy can swing by several percentage points and consistency rates are often well below 100%. The authors release code and a public robustness leaderboard.

Problem Statement

Standard LLM reports quote one tuned accuracy per dataset. That hides how much real outputs change when prompts are paraphrased, choices are reordered, or sampling seeds vary. Practitioners need a simple, repeatable way to measure both accuracy range and prediction consistency so they can judge reliability in production.

Main Contribution

Design and release SCORE, an open framework to measure non-adversarial robustness by repeating evaluations across prompts, choice orders, and random seeds.

Run SCORE on MMLU‑Pro, AGIEval, and MATH across many open LLMs and show that accuracy ranges and consistency rates vary widely.

Key Findings

Single reported accuracy hides instability.

NumbersMMLU‑Pro accuracy range up to 15.2% across prompts

Practical UseReport accuracy ranges and a consistency metric; do not compare models using only one tuned number.

Evidence RefSection 4.1; Fig.3; Table 4

Prompt wording changes accuracy noticeably.

NumbersPrompt paraphrases cause up to ~10% accuracy drop (abstract) and up to 15.2% observed

Practical UseValidate models with multiple semantically equivalent prompts before deploying; prefer models with narrow accuracy ranges.

Evidence RefAbstract; Section 4.1; Fig.3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	77.0% [75.3, 77.9]	—	—	AGIEval (aggregated)	Table 7 aggregated results	Table 7
Aggregated consistency rate (Llama-3.1 405B)	87.3%	—	—	AGIEval (aggregated)	Table 7 aggregated results	Table 7

What To Try In 7 Days

Run SCORE-style tests on your model: 8–10 prompt paraphrases, a few choice-order permutations, and 3–5 non-greedy seeds.

Report min/max accuracy and consistency rate alongside mean accuracy in internal model cards.

If consistency is low, add deterministic decoding or ensemble multiple responses for critical factual outputs.

Optimization Features

Infra Optimization

uses NVIDIA A100 80GB nodes; TRT-LLM conversion

Inference Optimization

converts models to TRT-LLM for faster evaluation

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/score https://github.com/NVIDIA/TensorRT-LLM

Risks & Boundaries

Limitations

Benchmark limited to three factual datasets (MMLU‑Pro, AGIEval, MATH); creative or subjective tasks not covered.

Heavy focus on MCQs makes parsing and stability easier but narrows scope.

When Not To Use

When you need adversarial robustness (SCORE uses non-adversarial perturbations).

For subjective or creative tasks where 'consistency' is ill-defined.

Failure Modes

Models can change predictions without score change (switching between incorrect answers).

Position bias in MCQs can inflate or deflate scores depending on option layout.

Core Entities

Models

Llama-3.1-405BLlama-3.1-70BLlama-3.1-8BLlama-3-70BMistral-Large-123BMistral-Nemo-12BQwen-2-72BQwen-2-7BYi-1.5-34B

Metrics

Accuracyconsistency rate (CR)

Datasets

MMLU-ProAGIEvalMATH

Benchmarks

SCOREMMLU-ProAGIEvalMATH

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Single reported accuracy hides instability.

Prompt wording changes accuracy noticeably.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Use all LLMs as judges: a fast, democratic way to rank models that matches human preference

Key finding

Judge with hidden states: use small models' internal vectors instead of prompting large LLMs

Key finding

DIBJUDGE: fine-tune judges to separate true quality signals from translation artifacts

Key finding

LLM judges favor 'new' and 'expert' labels but never admit it.

Key finding

Confuse the judge: a black-box method that labels LLM evaluations as high or low uncertainty

Key finding