Overview
Production Readiness
0.7
Novelty Score
0.5
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
Models that report a single accuracy number can fail unpredictably in production; measuring accuracy ranges and consistency reveals reliability risks before deployment.
Summary TLDR
SCORE is an open evaluation framework that measures how stable LLM answers are when prompts, answer order, or sampling seeds change. The paper runs multi-prompt, choice-order, and non-greedy-seed tests on three factual benchmarks (MMLU‑Pro, AGIEval, MATH) and many open models. Key takeaway: single-point accuracy hides real-world fragility — accuracy can swing by several percentage points and consistency rates are often well below 100%. The authors release code and a public robustness leaderboard.
Problem Statement
Standard LLM reports quote one tuned accuracy per dataset. That hides how much real outputs change when prompts are paraphrased, choices are reordered, or sampling seeds vary. Practitioners need a simple, repeatable way to measure both accuracy range and prediction consistency so they can judge reliability in production.
Main Contribution
Design and release SCORE, an open framework to measure non-adversarial robustness by repeating evaluations across prompts, choice orders, and random seeds.
Run SCORE on MMLU‑Pro, AGIEval, and MATH across many open LLMs and show that accuracy ranges and consistency rates vary widely.
Show that higher accuracy or larger model size does not always mean more stable predictions; publish code and a public leaderboard to track robustness.
Key Findings
Single reported accuracy hides instability.
Prompt wording changes accuracy noticeably.
Changing choice order shifts accuracy.
Sampling randomness affects outputs even when accuracy holds.
Model size is not a reliable proxy for stability.
Results
Accuracy
Aggregated consistency rate (Llama-3.1 405B)
Accuracy
Accuracy
MATH maximum consistency
Who Should Care
What To Try In 7 Days
Run SCORE-style tests on your model: 8–10 prompt paraphrases, a few choice-order permutations, and 3–5 non-greedy seeds.
Report min/max accuracy and consistency rate alongside mean accuracy in internal model cards.
If consistency is low, add deterministic decoding or ensemble multiple responses for critical factual outputs.
Optimization Features
Infra Optimization
- uses NVIDIA A100 80GB nodes; TRT-LLM conversion
Inference Optimization
- converts models to TRT-LLM for faster evaluation
Reproducibility
Code Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Benchmark limited to three factual datasets (MMLU‑Pro, AGIEval, MATH); creative or subjective tasks not covered.
- Heavy focus on MCQs makes parsing and stability easier but narrows scope.
- Possible dataset contamination risks remain; authors note public datasets may be in training data.
- Computational cost grows with more repeats and larger models.
When Not To Use
- When you need adversarial robustness (SCORE uses non-adversarial perturbations).
- For subjective or creative tasks where 'consistency' is ill-defined.
- If you cannot afford repeated evaluations across prompts and seeds.
Failure Modes
- Models can change predictions without score change (switching between incorrect answers).
- Position bias in MCQs can inflate or deflate scores depending on option layout.
- Sampling randomness (seed) can produce different answers even with similar accuracy.
Core Entities
Models
- Llama-3.1-405B
- Llama-3.1-70B
- Llama-3.1-8B
- Llama-3-70B
- Mistral-Large-123B
- Mistral-Nemo-12B
- Qwen-2-72B
- Qwen-2-7B
- Yi-1.5-34B
Metrics
- Accuracy
- consistency rate (CR)
Datasets
- MMLU-Pro
- AGIEval
- MATH
Benchmarks
- SCORE
- MMLU-Pro
- AGIEval
- MATH

