LLM Evaluation Papers — Parsed & Scored for Practitioners

Llama 2: open-release of 7B–70B pretrained models and RLHF‑tuned chat models competitive on human tests

0.70

0.30

0.60

2,595

Llama 2 provides openly available pretrained and RLHF‑tuned chat models that are competitive with closed models on many human-evaluated tasks, lowering the entry cost for companies that need high-quality chat AI while allowing customization and internal safety tuning.

Key finding

Llama 2 pretrained on ~2 trillion tokens; models range 7B–70B parameters.

Numbers: 2.0T tokens; sizes 7B,13B,34B,70B

GPT-4 exceeds USMLE pass threshold and outperforms prior models on medical benchmarks

0.30

0.60

0.70

497

GPT-4 can reliably answer medical multiple-choice questions and give better-calibrated confidence scores than earlier models, making it useful for education, drafting clinical notes, and decision support prototypes—provided human oversight and validation.

Key finding

GPT-4 strongly outperforms GPT-3.5 on USMLE-style multiple-choice tests.

Numbers: USMLE Self Assessment overall: GPT-4 83.76% (zero-shot) vs GPT-3.5 49.1%

Finetune a 65B LLM on a single 48GB GPU by training 4-bit models with adapters

0.80

0.90

485

QLoRA drastically lowers hardware cost and complexity for finetuning large LLMs, enabling teams to build custom chatbots and models on single consumer or pro GPUs and therefore speed development, lower cloud spend, and protect data privacy.

Key finding

QLoRA reduces the memory needed to finetune a 65B model from more than 780 GB to under 48 GB

Numbers: >780 GB -> <48 GB

Use strong LLMs (e.g., GPT-4) as scalable judges for human preference with checks for bias and math errors

0.70

0.60

0.70

433

High-quality LLMs (e.g., GPT-4) can automate preference labeling at ~80–85% human agreement, drastically cutting the time and cost of human evaluations for product iterations while remaining explainable.

Key finding

GPT-4 judgments align with human experts on non-tied MT-bench votes.

Numbers: 85% agreement (MT-bench non-tie, Table 5)

A broad third-party benchmark shows ChatGPT is a strong zero-shot performer but an unreliable reasoner and prone to hallucination

0.70

0.25

0.75

352

ChatGPT is a practical zero-shot workhorse: it saves time on many tasks and can replace some fine-tuned models for quick proofs of concept, but its factual and reasoning errors mean you must validate outputs before customer-facing or safety-critical use.

Key finding

ChatGPT often outperforms prior zero-shot LLMs.

Numbers: 9/13 evaluated datasets (zero-shot comparisons)

ChatGPT can match commercial translators for well-resourced languages; GPT-4 and 'pivot prompting' fix many weaknesses.

0.60

0.20

0.60

313

Large LMs like ChatGPT can replace or augment translation stacks for many high-resource language needs. Using a stronger engine (GPT-4) or pivoting through a major language improves coverage for low-resource and distant pairs. This lowers integration time for prototyping and can cut reliance on commercial APIs for some

Key finding

Prompt wording matters but has only modest effect.

Numbers: Best prompt (TP3) BLEU=24.73 vs TP1=23.25 (Table 3).

A 50B-parameter LLM trained on ~700B tokens, specialized for financial NLP

0.60

0.45

0.80

299

A mid-size LLM trained with a large curated finance corpus yields big real-world gains on finance tasks while staying useful on general tasks, so firms can get domain accuracy without running huge models.

Key finding

Mixed training (curated finance + public data) yields strong finance performance without losing general abilities

Numbers: Training corpus: 363B financial + 345B public ≈ 709B tokens; trained on 569B tokens

Practical survey of why LLMs hallucinate, how we measure it, and what fixes work today

0.70

0.50

0.60

233

Hallucinations create real risks (misinformation, legal/medical errors, loss of trust). Businesses should treat factuality as a first-class metric when deploying LLMs in production.

Key finding

Hallucination is multi-dimensional: input-, context-, and fact-conflicting types require different tests and fixes.

Practical survey: taxonomy, causes, detection, benchmarks, and fixes for hallucination in LLMs

0.70

0.40

0.60

207

Hallucinations make LLM outputs untrustworthy for decisions or customer-facing answers; mapping causes to fixes helps reduce risk in search, chatbots, and recommendations.

Key finding

The paper redefines hallucination for LLMs into two main types: factuality (real-world fact mismatch) and faithfulness (deviation from instructions or context).

A practical survey of how, where and what to test in large language models

0.70

0.40

0.60

195

Evaluation decides whether an LLM is fit for purpose: pick task‑specific tests, measure robustness and safety, and combine automated and human checks before deployment.

Key finding

No single benchmark or protocol reliably ranks all LLM capabilities.

Numbers: 46 popular benchmarks compiled (Sec.4, Table 7)

EvalPlus: auto-generated tests reveal up to ~29% lower pass rates and 11% bad 'ground-truth' in HumanEval

0.70

0.60

171

Small test suites give falsely high confidence in AI-generated code; automated, larger testing exposes real failure rates and helps select safer models for production.

Key finding

Automated augmentation increases tests per task from single-digit to hundreds.

Numbers: HumanEval avg tests 9.6 → HUMANEVAL+ avg 764.1

Augment ChatGPT with retrieved evidence and automated feedback to cut hallucinations

0.60

0.55

0.45

144

You can keep using a black-box LLM while reducing harmful hallucinations by adding retrieval, evidence consolidation, and automated feedback—improving factuality with modest engineering instead of costly fine-tuning.

Key finding

Retrieving consolidated evidence raises knowledge grounding (KF1) by about +10 points on news dialog.

Numbers: KF1: 26.71 -> 36.41 (ChatGPT -> LLM-AUGMENTER, News Chat, Table 1)

Reprogram frozen LLMs to forecast time series using text prototypes and Prompt-as-Prefix

0.70

0.60

0.70

127

You can add time series forecasting to an existing LLM deployment with little extra training and often better accuracy in low-data and cross-domain cases.

Key finding

TIME-LLM improves average long-term MSE over a fine-tuned LLM baseline (GPT4TS).

Numbers: ≈12% average MSE reduction vs GPT4TS on evaluated long-term benchmarks

MEDITRON: open-source 7B and 70B medical LLMs trained on a 48B-token curated medical corpus

0.30

0.60

0.50

117

MEDITRON offers a strong, open-source medical LLM that rivals much larger closed models on standard benchmarks, enabling in-house finetuning, auditing, and deployment experiments while avoiding vendor lock-in—though it is not yet production-ready for clinical use.

Key finding

MEDITRON obtains consistent accuracy gains on medical benchmarks over open baselines.

Numbers: Avg accuracy +6% vs best public baseline in class; +3% vs finetuned Llama-2 (reported)

Two-stage multimodal chain-of-thought lets sub‑1B models reason with images and text

0.60

0.45

96

You can get near-state-of-the-art multimodal reasoning with lightweight models by fine-tuning in two stages and fusing image features—this reduces hallucination and lowers compute cost versus running large multimodal LLMs.

Key finding

Predicting a rationale before the answer hurts small-model accuracy in one-stage text-only training.

Numbers: No-CoT 81.63% vs Reasoning 69.32% (↓12.31pp)

A systematic benchmark showing where GPT-style LLMs help — and where they fail — on practical chemistry tasks

0.40

0.35

0.50

91

LLMs can speed up human-in-the-loop chemistry tasks (text descriptions, candidate generation, reagent ranking) with few-shot prompts, but they are not yet reliable drop-in replacements for specialized models or automation pipelines where exact SMILES or reaction outcomes are needed.

Key finding

GPT-4 ranks best across the eight chemistry tasks.

Numbers: Average rank: GPT-4 = 1.25 (Table 2).

C-EVAL: 13.9k Chinese multiple-choice exam questions across 52 subjects, plus a HARD subset for advanced reasoning

0.70

0.50

0.60

90

C-EVAL exposes where Chinese users' LLMs fail on domain knowledge and hard reasoning; testing with it reveals gaps you need to fix before product launch.

Key finding

Only GPT-4 exceeds 60% average accuracy on C-EVAL.

Numbers: GPT-4 average accuracy 66.4% (zero-shot AO, Table 3)

ChatGPT often matches fine-tuned models on query/aspect summarization using zero-shot prompts

0.60

0.30

0.70

89

You can often skip costly fine-tuning and get usable aspect/query summaries by prompting ChatGPT zero-shot, but expect issues with very short target summaries and long documents unless you add retrieval or truncation.

Key finding

Zero-shot ChatGPT achieves comparable ROUGE scores to fine-tuned models on several aspect/query datasets.

Numbers: NEWTS R-1: ChatGPT 32.54 vs FT 31.78 (Table 2)

Let multiple copies of an LLM debate to improve reasoning and reduce hallucinations

0.50

0.60

0.40

85

If accuracy matters more than latency, running several LLM copies that debate can materially reduce wrong answers and hallucinations, producing higher-quality outputs for QA, math, and plan generation.

Key finding

Multiagent debate raises arithmetic accuracy from 67.0% to 81.8% on their test set.

Numbers: Arithmetic: 67.0% → 81.8% (Table 1)

A concise roadmap to multimodal LLMs: architectures, training recipes, evaluation, hallucination, and extensions

0.60

0.40

0.60

85

MLLMs let products combine vision and language: build image-aware assistants, document parsers, or multimodal agents. Focus on data quality, connector design, and safe alignment to reduce hallucinations before shipping.

Key finding

MLLMs are typically built from three modules: a pre-trained modality encoder, a pre-trained LLM, and a connector between them.

Let LLMs translate problems and a classical planner find correct, often optimal, plans

0.70

0.60

0.70

84

LLM+P turns LLMs into reliable natural-language front ends for proven symbolic planners. That reduces execution risk and often lowers real-world costs (e.g., fewer extra robot trips). It avoids expensive LLM fine-tuning by delegating correctness to existing planners.

Key finding

LLM+P produced correct or optimal plans in most evaluated domains while LLM-only methods usually failed.

Numbers: BLOCKSWORLD 90% (LLM 15–20%); GRIPPERS 95% (LLM 35%) ; STORAGE 85% (LLM 0%)

DeepSeek: scaling recipes and a 2T‑token bilingual pretraining run that yields 7B and 67B models competitive on code, math, and chat

0.70

0.60

0.70

82

The paper gives practical scaling recipes and hyperparameter fits so teams can plan compute, model size, and data investments more predictably; it shows a 67B open model can match or beat larger baselines on code/math when paired with curated bilingual data and alignment.

Key finding

Optimal batch size grows and optimal learning rate falls with compute; fitted power‑law relations give near‑optimal hyperparameters across budgets.

Numbers: near‑optimal region defined as ≤0.25% above min loss; fitted across 1e17–2e19 FLOPs

Small, irrelevant changes to Theory-of-Mind vignettes make GPT-3.5 fail

1.00

79

Relying on LLMs' apparent commonsense reasoning can be risky: models may fail on small, realistic changes and produce misleading outputs in user-facing scenarios.

Key finding

Making an opaque container transparent causes GPT-3.5 to predict the agent believes the wrong content.

Numbers: Variation 1A: P(chocolate)=95% vs P(popcorn)=0%

OpenAGI: an open platform that lets LLMs plan and call specialist models to solve multi-step tasks

0.50

0.60

0.45

76

OpenAGI shows you can compose existing specialist models under LLM control and use RL-style tuning to make smaller, cheaper models competitive—useful for building product workflows that call vision, text, or web tools.

Key finding

A large, general LLM (GPT-4) achieves the highest overall OpenAGI scores in zero/few-shot.

Numbers: GPT-4 overall: 0.2378 (zero) -> 0.5281 (few)