Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.5
Citation Count
3
Why It Matters For Business
ElecBench gives a clear, task-focused way to compare LLMs on power-dispatch needs—use it to pick models, spot weaknesses (math, hallucination, stability), and design safe human-in-the-loop workflows.
Summary TLDR
ElecBench is a new, public benchmark that measures how well large language models (LLMs) handle power-system tasks. It defines six primary metrics (factuality, logicality, stability, security, fairness, expressiveness) broken into 24 submetrics, builds a mixed test set (public sources, private professional text, simulations, and generated hallucination cases), and evaluates several LLMs (GPT-3.5, GPT-4, LLaMA2 family, GAIA family) on three core scenarios (general, dispatch, fault/black-start). Key takeaways: GPT-4 leads in reasoning and security on this test set; domain model GAIA-70B shows better stability in some tasks; smaller LLaMA models struggle with stability and math. The dataset and
Problem Statement
Power-system work needs numeric precision, simulation-aware scenarios, and domain knowledge. Existing general LLM benchmarks miss these needs. The paper builds a focused benchmark and dataset to measure LLMs on real power-dispatch tasks and related safety metrics.
Main Contribution
A domain-specific evaluation framework with six primary metrics and 24 submetrics tailored to power-system operation.
A mixed test set combining public problems, private professional text, programmatic simulations, and intentionally generated hallucination cases; released on GitHub.
An evaluation pipeline that uses dual-GPT4 scoring plus human adjudication and simulation scoring.
A comparative evaluation of major LLMs (GPT-3.5, GPT-4, LLaMA2 7/13/70B, GAIA 7/13/70B) across three scenarios: general, dispatch, and black-start.
Key Findings
GPT-4 scores highest on reasoning and security in ElecBench's general tests.
GAIA-70B shows stronger domain stability and good factuality versus other models.
Large LLaMA (70B) achieves high factuality and stability but weaker expressiveness.
Math and hallucination subtests show wide variance between models.
Results
GPT-4 logicality (general)
GAIA-70B stability (general)
LLaMA-70B factuality (general)
Who Should Care
What To Try In 7 Days
Run ElecBench tests on your LLM to see domain gaps (use the public repo).
Add numeric verification steps for any LLM outputs used in dispatch decisions.
Compare a general LLM (GPT-4) vs a domain model (GAIA-70B) on your core workflows to pick a candidate for trial.
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Some test items and scenario design use GPT-4 to generate or pre-classify cases, which can introduce bias.
- Simulation fidelity is not fully specified; real grid dynamics may diverge from simulated cases.
- Automatic scoring relies heavily on GPT-4 comparisons; judge bias is possible when the judge model resembles evaluated models.
When Not To Use
- Do not use LLM outputs from this benchmark as autonomous control commands without human validation.
- Not suitable as a certification test for safety-critical control software.
- Avoid replacing operator expertise in live dispatch decisions solely on benchmark scores.
Failure Modes
- Numeric hallucination and wrong calculations in critical formulas.
- Overconfidence or miscalibration of reported certainty.
- Instability under continuous-monitoring input sequences or format changes.
- Sycophancy: aligning to user claims over factual correctness.
Core Entities
Models
- gpt-4
- gpt-3.5-turbo
- llama2-7B
- llama2-13B
- llama2-70B
- gaia-7B
- gaia-13B
- gaia-70B
Metrics
- factuality
- logicality
- stability
- security
- fairness
- expressiveness
Datasets
- ElecBench (this paper)
- C-Eval (electrical engineering subset)
- MMLU (electrical engineering subset)
Benchmarks
- HELM
- AlpacaEval
- Xiezhi
- GLUE
- SuperGLUE
- SQuAD

