Automatic Evaluation Papers — Parsed & Scored for Practitioners

Finetune a 65B LLM on a single 48GB GPU by training 4-bit models with adapters

0.80

0.90

485

QLoRA drastically lowers hardware cost and complexity for finetuning large LLMs, enabling teams to build custom chatbots and models on single consumer or pro GPUs and therefore speed development, lower cloud spend, and protect data privacy.

Key finding

QLoRA reduces the memory needed to finetune a 65B model from more than 780 GB to under 48 GB

Numbers: >780 GB -> <48 GB

ChatGPT can match commercial translators for well-resourced languages; GPT-4 and 'pivot prompting' fix many weaknesses.

0.60

0.20

0.60

313

Large LMs like ChatGPT can replace or augment translation stacks for many high-resource language needs. Using a stronger engine (GPT-4) or pivoting through a major language improves coverage for low-resource and distant pairs. This lowers integration time for prototyping and can cut reliance on commercial APIs for some

Key finding

Prompt wording matters but has only modest effect.

Numbers: Best prompt (TP3) BLEU=24.73 vs TP1=23.25 (Table 3).

A practical survey of how, where and what to test in large language models

0.70

0.40

0.60

195

Evaluation decides whether an LLM is fit for purpose: pick task‑specific tests, measure robustness and safety, and combine automated and human checks before deployment.

Key finding

No single benchmark or protocol reliably ranks all LLM capabilities.

Numbers: 46 popular benchmarks compiled (Sec.4, Table 7)

EvalPlus: auto-generated tests reveal up to ~29% lower pass rates and 11% bad 'ground-truth' in HumanEval

0.70

0.60

171

Small test suites give falsely high confidence in AI-generated code; automated, larger testing exposes real failure rates and helps select safer models for production.

Key finding

Automated augmentation increases tests per task from single-digit to hundreds.

Numbers: HumanEval avg tests 9.6 → HUMANEVAL+ avg 764.1

MEDITRON: open-source 7B and 70B medical LLMs trained on a 48B-token curated medical corpus

0.30

0.60

0.50

117

MEDITRON offers a strong, open-source medical LLM that rivals much larger closed models on standard benchmarks, enabling in-house finetuning, auditing, and deployment experiments while avoiding vendor lock-in—though it is not yet production-ready for clinical use.

Key finding

MEDITRON obtains consistent accuracy gains on medical benchmarks over open baselines.

Numbers: Avg accuracy +6% vs best public baseline in class; +3% vs finetuned Llama-2 (reported)

ChatGPT often matches fine-tuned models on query/aspect summarization using zero-shot prompts

0.60

0.30

0.70

89

You can often skip costly fine-tuning and get usable aspect/query summaries by prompting ChatGPT zero-shot, but expect issues with very short target summaries and long documents unless you add retrieval or truncation.

Key finding

Zero-shot ChatGPT achieves comparable ROUGE scores to fine-tuned models on several aspect/query datasets.

Numbers: NEWTS R-1: ChatGPT 32.54 vs FT 31.78 (Table 2)

Ragas: reference-free checks for RAG faithfulness, relevance, and context focus

0.60

0.50

65

Provides fast, automated checks to catch ungrounded answers and noisy retrieval, reducing time spent on manual labeling and lowering hallucination risk in RAG deployments.

Key finding

Ragas matches human judgements on faithfulness with very high accuracy.

Numbers: Faithfulness accuracy 0.95 on WikiEval (Table 1)

Instruction tuning, not model size, drives LLM zero-shot news summarization; benchmark references are often worse than generated summaries.

0.60

0.35

0.40

64

If you want usable zero-shot news summaries, use instruction-tuned LLMs rather than largest-parameter models; validate or replace public benchmark references before trusting automatic metrics.

Key finding

Instruction tuning yields much stronger zero-shot summarization than model scale.

Numbers: Zero-shot Instruct Davinci faithfulness 0.99 vs GPT-3 175B faithfulness 0.76 on CNN/DM (Table 2)

SEED-Bench: a 19K, 12-dimension multiple-choice benchmark for testing image and video LLM comprehension

0.40

0.45

0.30

52

SEED-Bench gives a large, objective test to reveal real weaknesses in multimodal models (OCR, spatial relations, temporal reasoning), so businesses should validate models on similar slices before deploying image/video features.

Key finding

SEED-Bench contains 19K human-verified multiple-choice questions across 12 dimensions.

Numbers: 19,242 questions; 12 dimensions

ChatGPT can judge summary factuality zero‑shot but shows lexical bias, false reasoning, and prompt sensitivity

0.60

0.40

0.70

50

ChatGPT offers a ready-to-use, zero-shot factuality evaluator that can reduce annotation and training costs and often aligns better with human judgments, but it needs calibration for paraphrase-heavy or domain-specific text.

Key finding

ChatGPT (zero-shot + CoT) often matches or beats prior factuality metrics on multiple benchmarks.

Numbers: CoGenSum BA 74.3% vs SummaC ZS 70.4%; SummEval 83.3% vs 78.7% (Table 2)

Black-box prompts plus sampling help, but LLMs stay overconfident and struggle to predict failures

0.40

0.45

0.55

49

When deploying LLMs, naive verbalized confidence is unsafe: models often claim 80–100% confidence even when wrong, so use sampling + aggregation and validate calibration before trusting outputs.

Key finding

LLMs output verbalized confidences heavily skewed to high values (80–100%), causing overconfidence.

Numbers: confidence values mostly in 80–100% range; many expressed in multiples of 5

Systematic test shows current detectors fail to reliably spot ChatGPT text

1.00

36

Current off-the-shelf detectors miss most ChatGPT outputs while rarely mislabeling human text; companies cannot depend on these tools alone for content safety or compliance.

Key finding

No evaluated detector consistently detects ChatGPT-generated text.

Numbers: Best observed TPR ≤ 47.3% on the paper's Table I

Clinical Camel: an open medical LLM fine-tuned with dialogue synthesis and single‑GPU QLoRA

0.20

0.60

0.40

35

An open, high-performing medical LLM reduces vendor lock-in, enables internal validation, and can be reproduced with modest compute, letting institutions experiment safely before any clinical adoption.

Key finding

Clinical Camel-70B beats GPT-3.5 on several medical QA benchmarks in five-shot tests.

Numbers: USMLE 64.3% vs GPT-3.5 58.5%; PubMedQA 77.9% vs 60.2%

Psy-LLM: fine-tuned Chinese LLM for scalable online mental-health Q&A

0.40

0.35

0.60

34

An LLM-based Chinese Q&A assistant can reduce counsellor load, provide fast triage and scale service availability cheaply, but should be used under human supervision.

Key finding

PanGu 350M produced lower perplexity than WenZhong on the evaluation data.

Numbers: Perplexity: PanGu 34.56 vs WenZhong 38.40

Human judges prefer LLM summaries; reference summaries often contain more hallucinations.

0.70

0.40

0.60

32

Zero-shot LLMs can produce higher-quality, more factual summaries than many human references and fine-tuned models, so businesses can often deploy LLM summarization directly and shift effort to dataset curation and verification.

Key finding

Human judges prefer LLM summaries over human-written and fine-tuned model summaries in pairwise comparisons.

Numbers: Human preference scores for LLMs exceed 50% across tasks (Figure 4).

LLMs (text‑davinci‑003, ChatGPT) can mimic expert human ratings on story quality and adversarial text, cheaply and reproducibly.

0.60

0.40

0.80

31

LLMs can provide fast, cheap, and reproducible quality checks during model development, reducing reliance on costly expert rounds while preserving relative system comparisons.

Key finding

A strong InstructGPT (text‑davinci‑003) correlates positively with expert teacher ratings on individual stories.

Numbers: Kendall's τ up to 0.38 (relevance) vs teachers

Practical review: how large language models can help — and where they fall short — in language teaching and automated assessment

0.40

0.35

0.60

31

LLMs let EdTech scale content creation and interactive features quickly, but they add compute cost and require human oversight to avoid quality, bias and calibration problems.

Key finding

LLMs produce better open-ended text generation than prior small models, enabling plausible content generation for reading and chat practice.

LLM graders prefer an answer's position — simple calibration and a little human help fix it

0.60

0.45

0.60

29

If you auto-grade or compare models with LLMs, order effects can flip results and mislead decisions; applying MEC+BPC and targeted human checks improves reliability and cuts annotation cost.

Key finding

LLM evaluators frequently conflict when candidate order is swapped.

Numbers: GPT-4 conflict rate 46.3% (Vicuna vs ChatGPT); ChatGPT 82.5% (Table 2)

Train a lightweight judge-model (PandaLM) to pick better hyperparameters for instruction-tuned LLMs, reducing human/API cost while matching/

0.60

0.55

0.70

28

PandaLM reduces the cost and privacy risk of hyperparameter tuning by replacing paid API or large-scale human evaluation with a runnable judge model that selects better tuning settings.

Key finding

PandaLM-70B matches or slightly exceeds GPT-4 on a human-aligned test set.

Numbers: PandaLM-70B accuracy 0.6687 vs GPT-4 0.6647 (Table 2)

How LLMs are reshaping healthcare: capabilities, data needs, risks, and where to start

0.40

0.30

0.70

28

LLMs now match or approach clinician-level performance on some exam-style tasks and can speed documentation, triage, and literature review—but risk, privacy, and integration costs mean businesses must plan governance and hybrid human+AI workflows.

Key finding

Top LLMs approach human performance on exam-style medical questions.

Numbers: USMLE: GPT-4 86.7%, Med‑PaLM 2 86.5%, Human 87.0%

LLMs (GPT-3.5 / GPT-4) can handle document translation and often beat commercial MT by human judgment

0.60

0.40

0.50

27

LLMs (especially GPT-4) can produce more coherent, human-preferred document translations; firms should test LLMs for end-user quality, not just automatic scores.

Key finding

Human raters prefer GPT-4 outputs over commercial MT systems on document translation.

Numbers: Human average (general/discourse): GPT-4 3.0/3.1 vs Google 1.7/1.8 (Table 4)

IFEval: an automatic benchmark that checks whether LLMs obey concrete, machine-checkable instructions

0.60

0.40

0.50

26

IFEval gives a fast, repeatable way to measure whether models obey concrete user constraints, so product and engineering teams can track regressions and prioritize fixes.

Key finding

IFEval defines 25 instruction types and provides roughly 541 prompts.

Numbers: 25 types; 541 prompts

ChartLlama: a multimodal LLM trained on GPT‑4‑synthesized chart data for chart understanding and generation

0.60

0.70

0.50

25

Companies that need automated reading, generation, or editing of charts can improve accuracy and add code-generation features by training multimodal models on synthetic, code‑paired chart datasets.

Key finding

ChartLlama improves ChartQA accuracy versus prior open models on evaluated splits.

Numbers: ChartQA average: ChartLlama 69.66 vs Unichart 66.24 (Table 2/5)

Do LLMs write like a personality? GPT-3.5 and GPT-4 can be prompted to express Big Five traits and people often recognize them

0.40

0.60

0.30

24

Prompted LLMs can reliably take on personality-like profiles and produce believable narratives; this matters for products that personalize voice, human simulation, or content moderation and suggests disclosure policies are needed.

Key finding

LLM personas' self-reported BFI scores match their prompted traits with very large effects.

Numbers: GPT-4 Cohen's d: EXT 5.47; AGR 4.22; CON 4.39; NEU 5.17; OPN 6.30 (p<.001)