ERI: 57,750 engineering instruction-response items across 9 fields to test LLM reasoning and agent tool-use

Overview

Decision SnapshotNeeds Validation

ERI is ready as a regression and diagnostic tool for engineering LLM evaluation; use it with human review and slice audits before trusting outputs for high-stakes decisions.

Citations0

Evidence Strength0.80

Confidence0.90

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 0/7

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 60%

Authors

MZ Naser, Ahmad Bani Awwad, Zoie McCreery, Radwa Eissa, Ahmad Naser, Gianluca Cusatis, Andrew Metcalf, Kapil Madathil, Jamal Abdalla, Venkatesh Kodur, Mohammad Reza Saeb

Links

Abstract / PDF / Code / Data

Why It Matters For Business

ERI lets teams measure engineering-specific failure modes (units, constraints, verification) so you can route tasks to cheaper models when safe and reserve frontier models for high-risk work.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Data Scientist

Summary TLDR

ERI is a large, taxonomy-driven benchmark built to test engineering reasoning and instruction following in LLMs and agent pipelines. It covers 9 engineering fields, 55 subdomains, 7 intent types (definition, explanation, calculation, comparison, design, troubleshooting, code), and 3 difficulty tiers, producing 57,750 structured instruction-response records. The authors release JSONL splits, taxonomy specs, validation scripts, and an evaluation harness. In a 10% stratified test, frontier LLMs (GPT-5, Claude Sonnet 4, DeepSeek V3.1) averaged >4.30/5, smaller 7–8B models averaged ~3.0 with >10% failures, and the team bounds hallucination contamination in references to ~1.7%. ERI is meant for R

Problem Statement

General benchmarks miss engineering-specific checks like units, constraints, and verification steps. ERI addresses that gap by forcing structured coverage across field, intent, and difficulty so teams can detect silent failures (e.g., violating constraints or inventing assumptions) and make reliable routing, fine-tuning, or audit decisions.

Main Contribution

A 57,750-item instruction-response dataset with per-item metadata (field, subdomain, intent, difficulty).

A taxonomy covering 9 engineering fields and 55 subdomains to guarantee slice-level coverage.

Key Findings

ERI contains 57,750 instruction-response records with explicit metadata.

Numbers57,750 records; 1,155 metadata cells (50 per cell)

Practical UseYou can evaluate or fine-tune models on controlled slices (field × intent × difficulty) rather than only broad averages.

Evidence RefAbstract; Sec 4.1; Sec 5.1

Frontier models deliver near-expert average quality on ERI.

NumbersGPT-5 mean 4.48/5 (σ=0.49); top-3 >4.30

Practical UseReserve high-risk, high-rigor engineering tasks for frontier models to reduce silent constraint violations.

Evidence RefSec 6.2, Fig.3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Dataset size	57,750 records (1,155 cells; 50 samples per cell)	—	—	full	Sec 4.1; Abstract	Sec 4.1
Top model mean score	GPT-5 mean 4.48/5 (σ=0.49)	—	—	10% stratified test	Sec 6.2, Fig.3	Sec 6.2

What To Try In 7 Days

Load the ERI test split from HuggingFace and run your model on a few targeted slices (e.g., CALC, DES).

Use the provided evaluation harness to collect automatic checks and rubric scores for those slices.

Set up a simple router: route undergraduate CALC to a cheaper model and graduate/DES to a stronger model; measure cost vs. failure rate change.

Agent Features

Planning

tool planning evaluation (calculator/solver integration)

Tool Use

calculator/solver/checker integrationplanned tool-trace extension for agent evaluation

Frameworks

LLM-as-a-judgeSelf-Instruct generation

Is Agentic

Yes

Architectures

MoEmodel-as-judge panel

Collaboration

multi-judge aggregation across providers

Optimization Features

Training Optimization

SFTLoRA

Inference Optimization

routing low-rigor slices to smaller models to save cost

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/mznaser-clemson/ERI-Benchmark https://huggingface.co/datasets/mznaser/ERI-Benchmark

Data URLs

https://huggingface.co/datasets/mznaser/ERI-Benchmark

Risks & Boundaries

Limitations

Synthetic generation inherits generator model biases and temporal anchoring.

Not a substitute for jurisdiction-specific code compliance or stamped engineering review.

When Not To Use

Certifying safety or legal compliance

Directly automating high-stakes control without human oversight

Failure Modes

Silent constraint violations (units, bounds) despite plausible prose

Overfitting to benchmark phrasing (teaching to the test)

Core Entities

Models

GPT-5Claude Sonnet 4DeepSeek V3.1Claude Haiku 4.5GPT-4.1 MiniMistral Small 3Mistral 7BLlama 3.3 70BQwen 2.5 7BLlama 3.1 8BGPT-5.1

Metrics

mean score (1-5)failure rate (score ≤2)Spearman inter-judge correlationσ (per-item standard deviation)perfect-score rate

Datasets

ERI-Benchmark (mznaser/ERI-Benchmark on HuggingFace)

Benchmarks

MMLUBIG-benchGSM8KEngDesignSoM-1K

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

ERI contains 57,750 instruction-response records with explicit metadata.

Frontier models deliver near-expert average quality on ERI.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding