Overview
The paper builds a realistic benchmark and human-graded evaluation, but results are limited to three closed commercial models and rely on expert scoring rather than automated ground truth for all items.
Citations19
Evidence Strength0.60
Confidence0.85
Risk Signals12
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 0/4
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 50%
Production readiness: 40%
Novelty: 60%
Why It Matters For Business
Text LLMs can help generate control designs and explanations quickly, but they commonly make calculation and plot-reading errors, so use them for drafts and human-in-the-loop workflows, not final safety-critical designs.
Who Should Care
Summary TLDR
The authors introduce ControlBench, a 147-question dataset of undergraduate control problems (26 with plots), and test GPT-4, Claude 3 Opus, and Gemini 1.0 Ultra. Claude 3 Opus performs best (≈58.5% accuracy, rising with a self-check prompt), GPT-4 is middling, and Gemini trails. All models struggle with reading Bode/Nyquist/root-locus plots and make arithmetic/symbolic mistakes. Simple self-check prompts and tool-based arithmetic can improve results, but these LLMs are not yet reliable for unsupervised, high-stakes control work.
Problem Statement
Can state-of-the-art LLMs solve typical undergraduate control-engineering problems? The paper builds ControlBench (147 problems, many with plots), then measures model accuracy, failure modes, and how much self-checking helps.
Main Contribution
Created ControlBench: 147 undergraduate control problems covering stability, time response, design, and plots.
Measured GPT-4, Claude 3 Opus, and Gemini 1.0 Ultra on ControlBench with human expert grading.
Key Findings
Claude 3 Opus outperforms GPT-4 and Gemini on ControlBench.
All models struggle on problems that require reading plots (Bode, Nyquist, root-locus).
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| ControlBench ACC | Claude 58.5% (86/147); GPT-4 45.6% (67/147); Gemini 34.0% (50/147) | — | — | ControlBench (147 problems) | Table 2 overall totals | Table 2 |
| ControlBench ACC-s (after self-check) | Claude 68.7% (101/147); GPT-4 47.6% (70/147); Gemini 38.8% (57/147) | — | — | ControlBench (self-checked) | Table 2 overall totals | Table 2 |
What To Try In 7 Days
Run Claude 3 Opus on a small set of textbook control tasks and compare to domain experts.
Add a 'check your answer' follow-up prompt to see immediate accuracy gains.
Pipe numeric steps to an external calculator or small Python/Matlab verifier for arithmetic/symbolic checks.
Reproducibility
Risks & Boundaries
Limitations
Dataset focuses on classical undergraduate problems; not exhaustive for advanced/robust/nonlinear control.
Models evaluated are closed-source commercial systems; internal prompts and exact versions vary.
When Not To Use
For unsupervised design of safety-critical controllers.
Where accurate interpretation of plots (Bode/Nyquist/root-locus) is required.
Failure Modes
Misreading graphical plots (Bode, Nyquist, root-locus)
Arithmetic and symbolic calculation errors

