Overview
SCORE gives practical, repeatable checks (prompts, order, seeds) that teams can run quickly; evidence is strong on three factual datasets but not on creative tasks.
Citations0
Evidence Strength0.90
Confidence0.85
Risk Signals10
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 0/5
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 70%
Novelty: 50%
Why It Matters For Business
Models that report a single accuracy number can fail unpredictably in production; measuring accuracy ranges and consistency reveals reliability risks before deployment.
Who Should Care
Summary TLDR
SCORE is an open evaluation framework that measures how stable LLM answers are when prompts, answer order, or sampling seeds change. The paper runs multi-prompt, choice-order, and non-greedy-seed tests on three factual benchmarks (MMLU‑Pro, AGIEval, MATH) and many open models. Key takeaway: single-point accuracy hides real-world fragility — accuracy can swing by several percentage points and consistency rates are often well below 100%. The authors release code and a public robustness leaderboard.
Problem Statement
Standard LLM reports quote one tuned accuracy per dataset. That hides how much real outputs change when prompts are paraphrased, choices are reordered, or sampling seeds vary. Practitioners need a simple, repeatable way to measure both accuracy range and prediction consistency so they can judge reliability in production.
Main Contribution
Design and release SCORE, an open framework to measure non-adversarial robustness by repeating evaluations across prompts, choice orders, and random seeds.
Run SCORE on MMLU‑Pro, AGIEval, and MATH across many open LLMs and show that accuracy ranges and consistency rates vary widely.
Key Findings
Single reported accuracy hides instability.
Prompt wording changes accuracy noticeably.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | 77.0% [75.3, 77.9] | — | — | AGIEval (aggregated) | Table 7 aggregated results | Table 7 |
| Aggregated consistency rate (Llama-3.1 405B) | 87.3% | — | — | AGIEval (aggregated) | Table 7 aggregated results | Table 7 |
What To Try In 7 Days
Run SCORE-style tests on your model: 8–10 prompt paraphrases, a few choice-order permutations, and 3–5 non-greedy seeds.
Report min/max accuracy and consistency rate alongside mean accuracy in internal model cards.
If consistency is low, add deterministic decoding or ensemble multiple responses for critical factual outputs.
Optimization Features
Infra Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Benchmark limited to three factual datasets (MMLU‑Pro, AGIEval, MATH); creative or subjective tasks not covered.
Heavy focus on MCQs makes parsing and stability easier but narrows scope.
When Not To Use
When you need adversarial robustness (SCORE uses non-adversarial perturbations).
For subjective or creative tasks where 'consistency' is ill-defined.
Failure Modes
Models can change predictions without score change (switching between incorrect answers).
Position bias in MCQs can inflate or deflate scores depending on option layout.

