ControlBench: evaluate GPT-4, Claude 3 Opus, Gemini on 147 undergraduate control problems

Overview

Decision SnapshotNeeds Validation

The paper builds a realistic benchmark and human-graded evaluation, but results are limited to three closed commercial models and rely on expert scoring rather than automated ground truth for all items.

Citations19

Evidence Strength0.60

Confidence0.85

Risk Signals12

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 0/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 40%

Novelty: 60%

Authors

Darioush Kevian, Usman Syed, Xingang Guo, Aaron Havens, Geir Dullerud, Peter Seiler, Lianhui Qin, Bin Hu

Links

Abstract / PDF / Data

Why It Matters For Business

Text LLMs can help generate control designs and explanations quickly, but they commonly make calculation and plot-reading errors, so use them for drafts and human-in-the-loop workflows, not final safety-critical designs.

Who Should Care

ML Engineer Data Scientist Engineering Lead Product Manager CTO

Summary TLDR

The authors introduce ControlBench, a 147-question dataset of undergraduate control problems (26 with plots), and test GPT-4, Claude 3 Opus, and Gemini 1.0 Ultra. Claude 3 Opus performs best (≈58.5% accuracy, rising with a self-check prompt), GPT-4 is middling, and Gemini trails. All models struggle with reading Bode/Nyquist/root-locus plots and make arithmetic/symbolic mistakes. Simple self-check prompts and tool-based arithmetic can improve results, but these LLMs are not yet reliable for unsupervised, high-stakes control work.

Problem Statement

Can state-of-the-art LLMs solve typical undergraduate control-engineering problems? The paper builds ControlBench (147 problems, many with plots), then measures model accuracy, failure modes, and how much self-checking helps.

Main Contribution

Created ControlBench: 147 undergraduate control problems covering stability, time response, design, and plots.

Measured GPT-4, Claude 3 Opus, and Gemini 1.0 Ultra on ControlBench with human expert grading.

Key Findings

Claude 3 Opus outperforms GPT-4 and Gemini on ControlBench.

NumbersACC 58.5% (86/147), ACC-s 68.7% (101/147)

Practical UseUse Claude 3 Opus as the best off-the-shelf LLM tested for textbook control tasks, but validate outputs with tools or experts.

Evidence RefTable 2

All models struggle on problems that require reading plots (Bode, Nyquist, root-locus).

Numbers26 visual problems total; Bode ACC low (e.g., Claude 13.3% (2/15))

Practical UseDo not rely on these LLMs to interpret control plots; add a vision-focused pipeline or numerical descriptors for plots.

Evidence RefTable 1 and Table 2 (Bode Analysis row)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
ControlBench ACC	Claude 58.5% (86/147); GPT-4 45.6% (67/147); Gemini 34.0% (50/147)	—	—	ControlBench (147 problems)	Table 2 overall totals	Table 2
ControlBench ACC-s (after self-check)	Claude 68.7% (101/147); GPT-4 47.6% (70/147); Gemini 38.8% (57/147)	—	—	ControlBench (self-checked)	Table 2 overall totals	Table 2

What To Try In 7 Days

Run Claude 3 Opus on a small set of textbook control tasks and compare to domain experts.

Add a 'check your answer' follow-up prompt to see immediate accuracy gains.

Pipe numeric steps to an external calculator or small Python/Matlab verifier for arithmetic/symbolic checks.

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

https://agi4engineering.github.io/LLM4Control/

Risks & Boundaries

Limitations

Dataset focuses on classical undergraduate problems; not exhaustive for advanced/robust/nonlinear control.

Models evaluated are closed-source commercial systems; internal prompts and exact versions vary.

When Not To Use

For unsupervised design of safety-critical controllers.

Where accurate interpretation of plots (Bode/Nyquist/root-locus) is required.

Failure Modes

Misreading graphical plots (Bode, Nyquist, root-locus)

Arithmetic and symbolic calculation errors

Core Entities

Models

GPT-4Claude 3 OpusGemini 1.0 Ultra

Metrics

Accuracy

Datasets

ControlBenchControlBench-C

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Claude 3 Opus outperforms GPT-4 and Gemini on ControlBench.

All models struggle on problems that require reading plots (Bode, Nyquist, root-locus).

Results

What To Try In 7 Days

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding