Overview
Production Readiness
0.4
Novelty Score
0.6
Cost Impact Score
0.5
Citation Count
19
Why It Matters For Business
Text LLMs can help generate control designs and explanations quickly, but they commonly make calculation and plot-reading errors, so use them for drafts and human-in-the-loop workflows, not final safety-critical designs.
Summary TLDR
The authors introduce ControlBench, a 147-question dataset of undergraduate control problems (26 with plots), and test GPT-4, Claude 3 Opus, and Gemini 1.0 Ultra. Claude 3 Opus performs best (≈58.5% accuracy, rising with a self-check prompt), GPT-4 is middling, and Gemini trails. All models struggle with reading Bode/Nyquist/root-locus plots and make arithmetic/symbolic mistakes. Simple self-check prompts and tool-based arithmetic can improve results, but these LLMs are not yet reliable for unsupervised, high-stakes control work.
Problem Statement
Can state-of-the-art LLMs solve typical undergraduate control-engineering problems? The paper builds ControlBench (147 problems, many with plots), then measures model accuracy, failure modes, and how much self-checking helps.
Main Contribution
Created ControlBench: 147 undergraduate control problems covering stability, time response, design, and plots.
Measured GPT-4, Claude 3 Opus, and Gemini 1.0 Ultra on ControlBench with human expert grading.
Introduced ControlBench-C: 100 multiple-choice items for fast automatic testing by non-experts.
Analyzed failure modes: calculation errors, reasoning errors, misreading plots, prompt sensitivity, and self-correction effects.
Key Findings
Claude 3 Opus outperforms GPT-4 and Gemini on ControlBench.
All models struggle on problems that require reading plots (Bode, Nyquist, root-locus).
Self-check prompts materially improve performance, especially for Claude 3 Opus.
Raw accuracy across models is far from perfect on ControlBench.
Results
ControlBench ACC
ControlBench ACC-s (after self-check)
ControlBench-C ACC / ACC-s
Bode Analysis (example of visual failures)
Who Should Care
What To Try In 7 Days
Run Claude 3 Opus on a small set of textbook control tasks and compare to domain experts.
Add a 'check your answer' follow-up prompt to see immediate accuracy gains.
Pipe numeric steps to an external calculator or small Python/Matlab verifier for arithmetic/symbolic checks.
Reproducibility
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Dataset focuses on classical undergraduate problems; not exhaustive for advanced/robust/nonlinear control.
- Models evaluated are closed-source commercial systems; internal prompts and exact versions vary.
- Visual problems underrepresented by model capabilities; vision-language gaps remain.
- Human grading introduces subjectivity and slows scale of evaluation.
When Not To Use
- For unsupervised design of safety-critical controllers.
- Where accurate interpretation of plots (Bode/Nyquist/root-locus) is required.
- When exact symbolic or arithmetic correctness is mandatory.
Failure Modes
- Misreading graphical plots (Bode, Nyquist, root-locus)
- Arithmetic and symbolic calculation errors
- Reasoning errors leading to incorrect design choices
- Inconsistent outputs across prompt variants and samplings
- High model confidence when answers are wrong
Core Entities
Models
- GPT-4
- Claude 3 Opus
- Gemini 1.0 Ultra
Metrics
- Accuracy
Datasets
- ControlBench
- ControlBench-C

