SCORE: report accuracy ranges and consistency, not just one score

February 28, 20256 min

Overview

Decision SnapshotReady For Pilot

SCORE gives practical, repeatable checks (prompts, order, seeds) that teams can run quickly; evidence is strong on three factual datasets but not on creative tasks.

Citations0

Evidence Strength0.90

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 0/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 50%

Authors

Grigor Nalbandyan, Rima Shahbazyan, Evelina Bakhturina

Links

Abstract / PDF / Code

Why It Matters For Business

Models that report a single accuracy number can fail unpredictably in production; measuring accuracy ranges and consistency reveals reliability risks before deployment.

Who Should Care

Summary TLDR

SCORE is an open evaluation framework that measures how stable LLM answers are when prompts, answer order, or sampling seeds change. The paper runs multi-prompt, choice-order, and non-greedy-seed tests on three factual benchmarks (MMLU‑Pro, AGIEval, MATH) and many open models. Key takeaway: single-point accuracy hides real-world fragility — accuracy can swing by several percentage points and consistency rates are often well below 100%. The authors release code and a public robustness leaderboard.

Problem Statement

Standard LLM reports quote one tuned accuracy per dataset. That hides how much real outputs change when prompts are paraphrased, choices are reordered, or sampling seeds vary. Practitioners need a simple, repeatable way to measure both accuracy range and prediction consistency so they can judge reliability in production.

Main Contribution

Design and release SCORE, an open framework to measure non-adversarial robustness by repeating evaluations across prompts, choice orders, and random seeds.

Run SCORE on MMLU‑Pro, AGIEval, and MATH across many open LLMs and show that accuracy ranges and consistency rates vary widely.

Key Findings

Single reported accuracy hides instability.

NumbersMMLU‑Pro accuracy range up to 15.2% across prompts

Practical UseReport accuracy ranges and a consistency metric; do not compare models using only one tuned number.

Evidence RefSection 4.1; Fig.3; Table 4

Prompt wording changes accuracy noticeably.

NumbersPrompt paraphrases cause up to ~10% accuracy drop (abstract) and up to 15.2% observed

Practical UseValidate models with multiple semantically equivalent prompts before deploying; prefer models with narrow accuracy ranges.

Evidence RefAbstract; Section 4.1; Fig.3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy77.0% [75.3, 77.9]AGIEval (aggregated)Table 7 aggregated resultsTable 7
Aggregated consistency rate (Llama-3.1 405B)87.3%AGIEval (aggregated)Table 7 aggregated resultsTable 7

What To Try In 7 Days

Run SCORE-style tests on your model: 8–10 prompt paraphrases, a few choice-order permutations, and 3–5 non-greedy seeds.

Report min/max accuracy and consistency rate alongside mean accuracy in internal model cards.

If consistency is low, add deterministic decoding or ensemble multiple responses for critical factual outputs.

Optimization Features

Infra Optimization
uses NVIDIA A100 80GB nodes; TRT-LLM conversion
Inference Optimization
converts models to TRT-LLM for faster evaluation

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Benchmark limited to three factual datasets (MMLU‑Pro, AGIEval, MATH); creative or subjective tasks not covered.

Heavy focus on MCQs makes parsing and stability easier but narrows scope.

When Not To Use

When you need adversarial robustness (SCORE uses non-adversarial perturbations).

For subjective or creative tasks where 'consistency' is ill-defined.

Failure Modes

Models can change predictions without score change (switching between incorrect answers).

Position bias in MCQs can inflate or deflate scores depending on option layout.

Core Entities

Models

Llama-3.1-405BLlama-3.1-70BLlama-3.1-8BLlama-3-70BMistral-Large-123BMistral-Nemo-12BQwen-2-72BQwen-2-7BYi-1.5-34B

Metrics

Accuracyconsistency rate (CR)

Datasets

MMLU-ProAGIEvalMATH

Benchmarks

SCOREMMLU-ProAGIEvalMATH