ElecBench — a domain benchmark that tests LLMs on power-dispatch scenarios across six practical metrics.

Overview

Decision SnapshotReady For Pilot

The benchmark provides actionable comparative scores for picking and stress-testing LLMs, but scores reflect this test set and need local validation before operational use.

Citations3

Evidence Strength0.80

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 0/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 60%

Authors

Xiyuan Zhou, Huan Zhao, Yuheng Cheng, Yuji Cao, Gaoqi Liang, Guolong Liu, Wenxuan Liu, Yan Xu, Junhua Zhao

Links

Abstract / PDF / Code / Data

Why It Matters For Business

ElecBench gives a clear, task-focused way to compare LLMs on power-dispatch needs—use it to pick models, spot weaknesses (math, hallucination, stability), and design safe human-in-the-loop workflows.

Who Should Care

CTO Product Manager ML Engineer Data Scientist Engineering Lead

Summary TLDR

ElecBench is a new, public benchmark that measures how well large language models (LLMs) handle power-system tasks. It defines six primary metrics (factuality, logicality, stability, security, fairness, expressiveness) broken into 24 submetrics, builds a mixed test set (public sources, private professional text, simulations, and generated hallucination cases), and evaluates several LLMs (GPT-3.5, GPT-4, LLaMA2 family, GAIA family) on three core scenarios (general, dispatch, fault/black-start). Key takeaways: GPT-4 leads in reasoning and security on this test set; domain model GAIA-70B shows better stability in some tasks; smaller LLaMA models struggle with stability and math. The dataset and

Problem Statement

Power-system work needs numeric precision, simulation-aware scenarios, and domain knowledge. Existing general LLM benchmarks miss these needs. The paper builds a focused benchmark and dataset to measure LLMs on real power-dispatch tasks and related safety metrics.

Main Contribution

A domain-specific evaluation framework with six primary metrics and 24 submetrics tailored to power-system operation.

A mixed test set combining public problems, private professional text, programmatic simulations, and intentionally generated hallucination cases; released on GitHub.

Key Findings

GPT-4 scores highest on reasoning and security in ElecBench's general tests.

Numberslogicality 9.71; security 9.28 (general scenario)

Practical UseUse GPT-4 as the top-choice LLM for decision-support tasks where logical reasoning and security compliance matter, but validate numeric outputs before acting.

Evidence RefSection 4.2, Fig. 10

GAIA-70B shows stronger domain stability and good factuality versus other models.

Numbersstability 8.64; factuality 7.79 (general scenario)

Practical UseConsider domain-specialized models like GAIA-70B for workflows that require stable, domain-aligned answers and simulation-aware outputs.

Evidence RefSection 4.2, Fig. 10

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
GPT-4 logicality (general)	9.71	—	—	general scenario	Section 4.2, Fig. 10	Fig. 10
GAIA-70B stability (general)	8.64	—	—	general scenario	Section 4.2, Fig. 10	Fig. 10

What To Try In 7 Days

Run ElecBench tests on your LLM to see domain gaps (use the public repo).

Add numeric verification steps for any LLM outputs used in dispatch decisions.

Compare a general LLM (GPT-4) vs a domain model (GAIA-70B) on your core workflows to pick a candidate for trial.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/xiyuan-zhou/ElecBench-a-PowerDispatch-Evaluation-Benchmark-for-Large-LanguageModels

Data URLs

https://github.com/xiyuan-zhou/ElecBench-a-PowerDispatch-Evaluation-Benchmark-for-Large-LanguageModels

Risks & Boundaries

Limitations

Some test items and scenario design use GPT-4 to generate or pre-classify cases, which can introduce bias.

Simulation fidelity is not fully specified; real grid dynamics may diverge from simulated cases.

When Not To Use

Do not use LLM outputs from this benchmark as autonomous control commands without human validation.

Not suitable as a certification test for safety-critical control software.

Failure Modes

Numeric hallucination and wrong calculations in critical formulas.

Overconfidence or miscalibration of reported certainty.

Core Entities

Models

gpt-4gpt-3.5-turbollama2-7Bllama2-13Bllama2-70Bgaia-7Bgaia-13Bgaia-70B

Metrics

factualitylogicalitystabilitysecurityfairnessexpressiveness

Datasets

ElecBench (this paper)C-Eval (electrical engineering subset)MMLU (electrical engineering subset)

Benchmarks

HELMAlpacaEvalXiezhiGLUESuperGLUESQuAD

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

GPT-4 scores highest on reasoning and security in ElecBench's general tests.

GAIA-70B shows stronger domain stability and good factuality versus other models.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding