Overview
The benchmark provides actionable comparative scores for picking and stress-testing LLMs, but scores reflect this test set and need local validation before operational use.
Citations3
Evidence Strength0.80
Confidence0.80
Risk Signals10
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 0/3
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 50%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
ElecBench gives a clear, task-focused way to compare LLMs on power-dispatch needs—use it to pick models, spot weaknesses (math, hallucination, stability), and design safe human-in-the-loop workflows.
Who Should Care
Summary TLDR
ElecBench is a new, public benchmark that measures how well large language models (LLMs) handle power-system tasks. It defines six primary metrics (factuality, logicality, stability, security, fairness, expressiveness) broken into 24 submetrics, builds a mixed test set (public sources, private professional text, simulations, and generated hallucination cases), and evaluates several LLMs (GPT-3.5, GPT-4, LLaMA2 family, GAIA family) on three core scenarios (general, dispatch, fault/black-start). Key takeaways: GPT-4 leads in reasoning and security on this test set; domain model GAIA-70B shows better stability in some tasks; smaller LLaMA models struggle with stability and math. The dataset and
Problem Statement
Power-system work needs numeric precision, simulation-aware scenarios, and domain knowledge. Existing general LLM benchmarks miss these needs. The paper builds a focused benchmark and dataset to measure LLMs on real power-dispatch tasks and related safety metrics.
Main Contribution
A domain-specific evaluation framework with six primary metrics and 24 submetrics tailored to power-system operation.
A mixed test set combining public problems, private professional text, programmatic simulations, and intentionally generated hallucination cases; released on GitHub.
Key Findings
GPT-4 scores highest on reasoning and security in ElecBench's general tests.
GAIA-70B shows stronger domain stability and good factuality versus other models.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| GPT-4 logicality (general) | 9.71 | — | — | general scenario | Section 4.2, Fig. 10 | Fig. 10 |
| GAIA-70B stability (general) | 8.64 | — | — | general scenario | Section 4.2, Fig. 10 | Fig. 10 |
What To Try In 7 Days
Run ElecBench tests on your LLM to see domain gaps (use the public repo).
Add numeric verification steps for any LLM outputs used in dispatch decisions.
Compare a general LLM (GPT-4) vs a domain model (GAIA-70B) on your core workflows to pick a candidate for trial.
Reproducibility
Risks & Boundaries
Limitations
Some test items and scenario design use GPT-4 to generate or pre-classify cases, which can introduce bias.
Simulation fidelity is not fully specified; real grid dynamics may diverge from simulated cases.
When Not To Use
Do not use LLM outputs from this benchmark as autonomous control commands without human validation.
Not suitable as a certification test for safety-critical control software.
Failure Modes
Numeric hallucination and wrong calculations in critical formulas.
Overconfidence or miscalibration of reported certainty.

