ElecBench — a domain benchmark that tests LLMs on power-dispatch scenarios across six practical metrics.

July 7, 20247 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

3

Authors

Xiyuan Zhou, Huan Zhao, Yuheng Cheng, Yuji Cao, Gaoqi Liang, Guolong Liu, Wenxuan Liu, Yan Xu, Junhua Zhao

Links

Abstract / PDF

Why It Matters For Business

ElecBench gives a clear, task-focused way to compare LLMs on power-dispatch needs—use it to pick models, spot weaknesses (math, hallucination, stability), and design safe human-in-the-loop workflows.

Summary TLDR

ElecBench is a new, public benchmark that measures how well large language models (LLMs) handle power-system tasks. It defines six primary metrics (factuality, logicality, stability, security, fairness, expressiveness) broken into 24 submetrics, builds a mixed test set (public sources, private professional text, simulations, and generated hallucination cases), and evaluates several LLMs (GPT-3.5, GPT-4, LLaMA2 family, GAIA family) on three core scenarios (general, dispatch, fault/black-start). Key takeaways: GPT-4 leads in reasoning and security on this test set; domain model GAIA-70B shows better stability in some tasks; smaller LLaMA models struggle with stability and math. The dataset and

Problem Statement

Power-system work needs numeric precision, simulation-aware scenarios, and domain knowledge. Existing general LLM benchmarks miss these needs. The paper builds a focused benchmark and dataset to measure LLMs on real power-dispatch tasks and related safety metrics.

Main Contribution

A domain-specific evaluation framework with six primary metrics and 24 submetrics tailored to power-system operation.

A mixed test set combining public problems, private professional text, programmatic simulations, and intentionally generated hallucination cases; released on GitHub.

An evaluation pipeline that uses dual-GPT4 scoring plus human adjudication and simulation scoring.

A comparative evaluation of major LLMs (GPT-3.5, GPT-4, LLaMA2 7/13/70B, GAIA 7/13/70B) across three scenarios: general, dispatch, and black-start.

Key Findings

GPT-4 scores highest on reasoning and security in ElecBench's general tests.

Numberslogicality 9.71; security 9.28 (general scenario)

GAIA-70B shows stronger domain stability and good factuality versus other models.

Numbersstability 8.64; factuality 7.79 (general scenario)

Large LLaMA (70B) achieves high factuality and stability but weaker expressiveness.

Numbersfactuality 8.35; stability 9.03; expressiveness 6.04 (general scenario)

Math and hallucination subtests show wide variance between models.

NumbersGPT-4 math 9.67; GPT-3.5 misinformation 7.88; LLaMA-70B hallucination 6.9 (appendix A)

Results

GPT-4 logicality (general)

Value9.71

GAIA-70B stability (general)

Value8.64

LLaMA-70B factuality (general)

Value8.35

Who Should Care

What To Try In 7 Days

Run ElecBench tests on your LLM to see domain gaps (use the public repo).

Add numeric verification steps for any LLM outputs used in dispatch decisions.

Compare a general LLM (GPT-4) vs a domain model (GAIA-70B) on your core workflows to pick a candidate for trial.

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Some test items and scenario design use GPT-4 to generate or pre-classify cases, which can introduce bias.
  • Simulation fidelity is not fully specified; real grid dynamics may diverge from simulated cases.
  • Automatic scoring relies heavily on GPT-4 comparisons; judge bias is possible when the judge model resembles evaluated models.

When Not To Use

  • Do not use LLM outputs from this benchmark as autonomous control commands without human validation.
  • Not suitable as a certification test for safety-critical control software.
  • Avoid replacing operator expertise in live dispatch decisions solely on benchmark scores.

Failure Modes

  • Numeric hallucination and wrong calculations in critical formulas.
  • Overconfidence or miscalibration of reported certainty.
  • Instability under continuous-monitoring input sequences or format changes.
  • Sycophancy: aligning to user claims over factual correctness.

Core Entities

Models

  • gpt-4
  • gpt-3.5-turbo
  • llama2-7B
  • llama2-13B
  • llama2-70B
  • gaia-7B
  • gaia-13B
  • gaia-70B

Metrics

  • factuality
  • logicality
  • stability
  • security
  • fairness
  • expressiveness

Datasets

  • ElecBench (this paper)
  • C-Eval (electrical engineering subset)
  • MMLU (electrical engineering subset)

Benchmarks

  • HELM
  • AlpacaEval
  • Xiezhi
  • GLUE
  • SuperGLUE
  • SQuAD