Overview
Production Readiness
0.6
Novelty Score
0.3
Cost Impact Score
0.5
Citation Count
19
Why It Matters For Business
Choosing the right metrics avoids misleading conclusions about model quality and reduces costly deployment mistakes; adding uncertainty and bias checks makes model comparisons actionable and safer.
Summary TLDR
This paper surveys the metrics used to evaluate large language models (LLMs). It groups metrics into three practical families: Multiple-Classification (labels), Token-Similarity (text overlap/semantic), and Question-Answering (span/rank). For each metric it gives formulas, simple statistical interpretations, common Python implementations, and examples from biomedical LLM papers. The authors flag two major gaps: imperfect gold standards (noisy references) and missing statistical inference (no confidence intervals). They recommend combining metrics, borrowing bias-correction ideas from diagnostic studies, and adding uncertainty estimates for robust comparisons.
Problem Statement
LLM research uses many metrics but often ignores their statistical meaning, blind spots, and practical limits. Researchers need clear guidance on which metrics measure what, how to interpret scores, and how to handle noisy reference data and uncertainty when comparing models.
Main Contribution
Organizes LLM metrics into three clear types: Multiple-Classification, Token-Similarity, and Question-Answering.
Derives mathematical formulas and plain-statistical interpretations for common metrics (e.g., accuracy, F1, BLEU, ROUGE, perplexity, MRR).
Lists open-source implementations and repositories for these metrics in Python.
Shows how these metrics are used across recent biomedical LLMs and highlights common evaluation gaps.
Key Findings
Most LLM papers rely heavily on Multiple-Classification (MC) metrics like accuracy, precision, recall and F1.
Token-similarity metrics (ROUGE, METEOR, BERTScore) are common for generation tasks but each has blind spots: ROUGE treats all tokens equally, METEOR/BERTScore rely on external tools/parameters.
Question-answering evaluation often uses rank-based metrics (Strict/Lenient Accuracy, MRR) because answers are located by span and ranking.
Imperfect gold standards (noisy or ambiguous references) are a widespread and under-addressed problem in LLM evaluation.
Most LLM evaluations report point metrics without uncertainty; the paper calls out absence of statistical inference and confidence intervals.
Biomedical literature volume motivates efficient LLM evaluation: over 3,000 new peer-reviewed articles arrive daily, increasing demand for reliable metric pipelines.
Who Should Care
What To Try In 7 Days
Inventory metrics used in your stack and note missing ones (TS/QA/MC).
Add bootstrap confidence intervals to reported metrics for two key tasks.
For generation tasks, report ROUGE plus BERTScore or METEOR together to capture lexical and semantic quality.
Reproducibility
Open Source Status
- partial
Risks & Boundaries
Limitations
- Not exhaustive; focuses on the most common metrics only.
- Does not prescribe fixed metric-dataset pairings because datasets can be evaluated in multiple ways.
- No new metrics or automated pipelines provided; mostly interpretive guidance.
When Not To Use
- If you need an automated end-to-end evaluation pipeline (the paper is a survey).
- When labels are heavily ambiguous or free-form dialogue answers need semantic judgment without spans.
Failure Modes
- Relying solely on F1 or accuracy on imbalanced data can hide poor performance on minority classes.
- Using LLM-generated gold standards may inject hallucination bias into evaluations.
- Reporting point estimates without uncertainty can lead to overconfident model selection.
Core Entities
Models
- BioBERT
- BioGPT
- BioLinkBERT
- BioMegatron
- ClinicalBERT
- MedPaLM
- MedPaLM2
- PubMedBERT
- SciBERT
- SciFive
- RoBERTa
Metrics
- Accuracy
- Precision
- Recall
- F1 / micro-F1 / macro-F1
- AUC / PRAUC
- Perplexity
- BLEU
- ROUGE-1/2/L
- METEOR
- BERTScore
- SaCC
- LaCC
- MRR
Datasets
- PubMed
- PMC
- BC5CDR
- NCBI Disease
- MedNLI
- CHEMPROT
- BLURB
- SQuAD
Benchmarks
- NER
- RE
- QA
- Text Summarization (TS)

