ControlBench: evaluate GPT-4, Claude 3 Opus, Gemini on 147 undergraduate control problems

April 4, 20246 min

Overview

Production Readiness

0.4

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

19

Authors

Darioush Kevian, Usman Syed, Xingang Guo, Aaron Havens, Geir Dullerud, Peter Seiler, Lianhui Qin, Bin Hu

Links

Abstract / PDF

Why It Matters For Business

Text LLMs can help generate control designs and explanations quickly, but they commonly make calculation and plot-reading errors, so use them for drafts and human-in-the-loop workflows, not final safety-critical designs.

Summary TLDR

The authors introduce ControlBench, a 147-question dataset of undergraduate control problems (26 with plots), and test GPT-4, Claude 3 Opus, and Gemini 1.0 Ultra. Claude 3 Opus performs best (≈58.5% accuracy, rising with a self-check prompt), GPT-4 is middling, and Gemini trails. All models struggle with reading Bode/Nyquist/root-locus plots and make arithmetic/symbolic mistakes. Simple self-check prompts and tool-based arithmetic can improve results, but these LLMs are not yet reliable for unsupervised, high-stakes control work.

Problem Statement

Can state-of-the-art LLMs solve typical undergraduate control-engineering problems? The paper builds ControlBench (147 problems, many with plots), then measures model accuracy, failure modes, and how much self-checking helps.

Main Contribution

Created ControlBench: 147 undergraduate control problems covering stability, time response, design, and plots.

Measured GPT-4, Claude 3 Opus, and Gemini 1.0 Ultra on ControlBench with human expert grading.

Introduced ControlBench-C: 100 multiple-choice items for fast automatic testing by non-experts.

Analyzed failure modes: calculation errors, reasoning errors, misreading plots, prompt sensitivity, and self-correction effects.

Key Findings

Claude 3 Opus outperforms GPT-4 and Gemini on ControlBench.

NumbersACC 58.5% (86/147), ACC-s 68.7% (101/147)

All models struggle on problems that require reading plots (Bode, Nyquist, root-locus).

Numbers26 visual problems total; Bode ACC low (e.g., Claude 13.3% (2/15))

Self-check prompts materially improve performance, especially for Claude 3 Opus.

NumbersPaper reports a 13.6% accuracy improvement for Claude after self-correction (sec. 4.4)

Raw accuracy across models is far from perfect on ControlBench.

NumbersGPT-4 ACC 45.6% (67/147); Gemini ACC 34.0% (50/147)

Results

ControlBench ACC

ValueClaude 58.5% (86/147); GPT-4 45.6% (67/147); Gemini 34.0% (50/147)

ControlBench ACC-s (after self-check)

ValueClaude 68.7% (101/147); GPT-4 47.6% (70/147); Gemini 38.8% (57/147)

ControlBench-C ACC / ACC-s

ValueGPT-4 ACC 64.0% (64/100), ACC-s 78.0% (78/100); Claude ACC 59.0%, ACC-s 83.0%; Gemini ACC 56.0%, ACC-s 75.0%

Bode Analysis (example of visual failures)

ValueClaude ACC 13.3% (2/15); GPT-4 6.66% (1/15); Gemini 6.66% (1/15)

Who Should Care

What To Try In 7 Days

Run Claude 3 Opus on a small set of textbook control tasks and compare to domain experts.

Add a 'check your answer' follow-up prompt to see immediate accuracy gains.

Pipe numeric steps to an external calculator or small Python/Matlab verifier for arithmetic/symbolic checks.

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Dataset focuses on classical undergraduate problems; not exhaustive for advanced/robust/nonlinear control.
  • Models evaluated are closed-source commercial systems; internal prompts and exact versions vary.
  • Visual problems underrepresented by model capabilities; vision-language gaps remain.
  • Human grading introduces subjectivity and slows scale of evaluation.

When Not To Use

  • For unsupervised design of safety-critical controllers.
  • Where accurate interpretation of plots (Bode/Nyquist/root-locus) is required.
  • When exact symbolic or arithmetic correctness is mandatory.

Failure Modes

  • Misreading graphical plots (Bode, Nyquist, root-locus)
  • Arithmetic and symbolic calculation errors
  • Reasoning errors leading to incorrect design choices
  • Inconsistent outputs across prompt variants and samplings
  • High model confidence when answers are wrong

Core Entities

Models

  • GPT-4
  • Claude 3 Opus
  • Gemini 1.0 Ultra

Metrics

  • Accuracy

Datasets

  • ControlBench
  • ControlBench-C