Benchmark Construction Papers — Parsed & Scored for Practitioners

Use strong LLMs (e.g., GPT-4) as scalable judges for human preference with checks for bias and math errors

0.70

0.60

0.70

433

High-quality LLMs (e.g., GPT-4) can automate preference labeling at ~80–85% human agreement, drastically cutting the time and cost of human evaluations for product iterations while remaining explainable.

Key finding

GPT-4 judgments align with human experts on non-tied MT-bench votes.

Numbers: 85% agreement (MT-bench non-tie, Table 5)

A 50B-parameter LLM trained on ~700B tokens, specialized for financial NLP

0.60

0.45

0.80

299

A mid-size LLM trained with a large curated finance corpus yields big real-world gains on finance tasks while staying useful on general tasks, so firms can get domain accuracy without running huge models.

Key finding

Mixed training (curated finance + public data) yields strong finance performance without losing general abilities

Numbers: Training corpus: 363B financial + 345B public ≈ 709B tokens; trained on 569B tokens

A practical survey of how, where and what to test in large language models

0.70

0.40

0.60

195

Evaluation decides whether an LLM is fit for purpose: pick task‑specific tests, measure robustness and safety, and combine automated and human checks before deployment.

Key finding

No single benchmark or protocol reliably ranks all LLM capabilities.

Numbers: 46 popular benchmarks compiled (Sec.4, Table 7)

A systematic benchmark showing where GPT-style LLMs help — and where they fail — on practical chemistry tasks

0.40

0.35

0.50

91

LLMs can speed up human-in-the-loop chemistry tasks (text descriptions, candidate generation, reagent ranking) with few-shot prompts, but they are not yet reliable drop-in replacements for specialized models or automation pipelines where exact SMILES or reaction outcomes are needed.

Key finding

GPT-4 ranks best across the eight chemistry tasks.

Numbers: Average rank: GPT-4 = 1.25 (Table 2).

C-EVAL: 13.9k Chinese multiple-choice exam questions across 52 subjects, plus a HARD subset for advanced reasoning

0.70

0.50

0.60

90

C-EVAL exposes where Chinese users' LLMs fail on domain knowledge and hard reasoning; testing with it reveals gaps you need to fix before product launch.

Key finding

Only GPT-4 exceeds 60% average accuracy on C-EVAL.

Numbers: GPT-4 average accuracy 66.4% (zero-shot AO, Table 3)

A concise roadmap to multimodal LLMs: architectures, training recipes, evaluation, hallucination, and extensions

0.60

0.40

0.60

85

MLLMs let products combine vision and language: build image-aware assistants, document parsers, or multimodal agents. Focus on data quality, connector design, and safe alignment to reduce hallucinations before shipping.

Key finding

MLLMs are typically built from three modules: a pre-trained modality encoder, a pre-trained LLM, and a connector between them.

OpenAGI: an open platform that lets LLMs plan and call specialist models to solve multi-step tasks

0.50

0.60

0.45

76

OpenAGI shows you can compose existing specialist models under LLM control and use RL-style tuning to make smaller, cheaper models competitive—useful for building product workflows that call vision, text, or web tools.

Key finding

A large, general LLM (GPT-4) achieves the highest overall OpenAGI scores in zero/few-shot.

Numbers: GPT-4 overall: 0.2378 (zero) -> 0.5281 (few)

Instruction tuning, not model size, drives LLM zero-shot news summarization; benchmark references are often worse than generated summaries.

0.60

0.35

0.40

64

If you want usable zero-shot news summaries, use instruction-tuned LLMs rather than largest-parameter models; validate or replace public benchmark references before trusting automatic metrics.

Key finding

Instruction tuning yields much stronger zero-shot summarization than model scale.

Numbers: Zero-shot Instruct Davinci faithfulness 0.99 vs GPT-3 175B faithfulness 0.76 on CNN/DM (Table 2)

ToolBench + DFSDT + retriever teach LLaMA-2 to use 16k+ real REST APIs with ChatGPT-based annotation and evaluation

0.70

63

If you build assistants that call external services, training on many real APIs plus a retriever and multi-path planning dramatically reduces manual engineering and makes open-source models practically competitive with closed systems.

Key finding

ToolBench covers 16,464 real REST APIs and 126,486 labeled instruction→solution pairs.

Numbers: 16,464 APIs; 126,486 instances; 469,585 real API calls

A single-source survey of how we test LLMs: benchmarks, gaps, and practical directions

0.60

0.40

0.60

61

LLM evaluations show accuracy alone is insufficient: businesses must test truthfulness, bias, tool use, and robustness to avoid legal risks, bad UX, or harmful outputs.

Key finding

Public adoption exploded: ChatGPT reached 100 million users within two months of launch.

Numbers: 100M users in two months

A practical benchmark and playbook showing LLMs can speed social-science labeling and generate useful explanations — but not fully replace专家

0.60

0.50

0.60

61

LLMs can cut labeling time by producing reasonable draft labels and high-quality draft explanations; use them to scale annotation and speed exploratory analysis while keeping humans in the loop to validate and correct outputs.

Key finding

Zero-shot LLMs sometimes match or exceed human agreement on specific classification tasks.

Numbers: Misinfo F1=77.4, κ=0.55 vs human κ=0.51

LLMs excel at simple sentiment tasks but struggle with fine-grained, structured sentiment extraction

0.60

0.40

0.60

55

Off-the-shelf LLMs can replace expensive labeling for basic sentiment tasks and speed up pilot projects, but structured extraction and safety-sensitive detection still need specialist models or human review.

Key finding

LLMs match fine-tuned small models on simple sentiment classification in zero-shot.

Numbers: ChatGPT ≈97% of T5 performance on SC tasks (paper text).

Practical review of data, training, and evaluation methods to align LLMs with human preferences

0.60

0.40

0.70

54

Aligning LLMs reduces risky outputs and increases usefulness; using parameter-efficient tuning cuts compute costs and enables faster iteration.

Key finding

Small sets of high-quality instructions can suffice to produce alignment effects.

Numbers: LLaMA needs ~8K instructions (IFS); other work reports ~6K high-quality instructions

SEED-Bench: a 19K, 12-dimension multiple-choice benchmark for testing image and video LLM comprehension

0.40

0.45

0.30

52

SEED-Bench gives a large, objective test to reveal real weaknesses in multimodal models (OCR, spatial relations, temporal reasoning), so businesses should validate models on similar slices before deploying image/video features.

Key finding

SEED-Bench contains 19K human-verified multiple-choice questions across 12 dimensions.

Numbers: 19,242 questions; 12 dimensions

RGB: a bilingual benchmark diagnosing how LLMs fail when using retrieved evidence

0.40

0.30

0.25

52

RAG can improve factuality, but retrieved noise and false facts cause wrong outputs and missed refusals, risking user trust and legal/brand exposure in production.

Key finding

Adding noisy retrieved documents lowers answer accuracy for all tested LLMs.

Numbers: ChatGPT accuracy 96.33% → 76.00% (noise ratio 0→0.8)

PaperQA: an agentic RAG that retrieves full-text papers, cites sources, and matches experts on a new LitQA benchmark

0.70

0.55

0.80

51

PaperQA shows agentic retrieval plus LLMs can deliver near-expert literature answers with reliable citations and low cost per query, making it practical for automated literature triage, fast reviews, and decision support.

Key finding

PaperQA achieves 69.5% accuracy on LitQA, slightly above human experts.

Numbers: PaperQA 69.5% vs Human 66.8% (LitQA, Table 2)

Large multilingual evaluation shows ChatGPT is strong at grammar but weak at multilingual semantic tasks

0.30

0.40

0.60

51

ChatGPT zero-shot is good for quick grammar-level tasks (like POS tagging) but not reliable for production semantic tasks across many languages; invest in task- and language-specific models for higher accuracy and lower operational risk.

Key finding

ChatGPT generally underperforms supervised task-specific models on semantic multilingual tasks.

Numbers: XNLI avg acc: ChatGPT (en) 57.0% vs mT5-XXL 87.1%

GPT-4 can pass Japan's medical licensing exam but shows costly localization and safety gaps

0.50

0.40

0.70

50

LLMs can meet exam-level MCQ performance in non-English, specialized domains but need localization, safety filters, and higher budget due to tokenization and legal differences.

Key finding

GPT-4 passes all six years of the Japanese medical licensing exam (2018–2023) in closed-book multiple-choice format.

Numbers: 2018: required 161, general 221 (passing 160/208); Table 1

PIXIU: open financial LLM + 136K instruction examples and FLARE benchmark

0.60

0.50

43

Open domain-tuned models and labeled instruction data lower the bar to build finance-specific AI: cheaper customization, reproducible evaluation, and better performance on common text tasks; numeric QA and trading signals still need extra work.

Key finding

They built FIT with 136,609 instruction‑tuning examples across 5 tasks and 9 datasets.

Numbers: 136,609 samples; 5 tasks; 9 datasets

Survey of 23 LLM benchmarks finds widespread blind spots; recommends behavioral profiling and audits

0.40

0.60

42

Benchmark scores can mislead product decisions if they reflect memorization, prompt sensitivity, or English-only tests; firms should test models under realistic prompts, languages, and safety scenarios.

Key finding

Response variability breaks standardized tests

Numbers: 22/23 benchmarks showed sensitivity

Systematic benchmark: GPT-series and LLaMA variants vs. fine-tuned BioNLP models across 12 biomedical tasks

0.60

0.40

0.70

41

If you need high-accuracy extraction or classification in biomedical text, fine-tuned domain models remain the practical choice; use GPT-4 for reasoning or prototyping high-level QA but budget for much higher inference costs and add output validation.

Key finding

Fine-tuned, domain-specific models still outperform zero- and few-shot LLMs on most BioNLP tasks.

Numbers: Macro-average: SOTA fine-tuned 0.6536 vs. best LLM zero/few-shot ~0.51

ChemLLM: a 7B chemistry-tuned LLM with ChemData (7M Q&A) and ChemBench (4.1k MCQs), matching GPT-4 on core chemical tasks

0.60

0.70

40

A domain-tuned 7B model can match or beat much larger closed models on key chemistry tasks, enabling lower-cost deployment of chemistry assistants and search tools for R&D teams.

Key finding

ChemData size and scope

Numbers: 7M instruction Q&A (authors' dataset summary)

ToolQA — a benchmark that forces LLMs to use external tools, not memorized facts

0.30

0.40

0.20

39

If your product must use live or private data, you need tested tool integration and source selection; relying on a base LLM risks wrong or outdated answers.

Key finding

Standard LLMs that do not use external tools fail on ToolQA.

Numbers: ChatGPT avg success: 5.6% (easy), ~2% (hard)

Open-source toolkit, benchmark, and a retrieval-augmented LLM that proves Lean theorems on one GPU-week

0.60

0.65

0.70

38

LeanDojo lowers the entry cost for ML research on formal proofs: open data and code let teams reproduce and iterate on provers with a single GPU-week instead of thousands of GPU-days.

Key finding

Retrieval improves end-to-end proving rates.

Numbers: ReProver Pass@1 51.2% vs non-retrieval baseline 47.6% (random split)