ERI: 57,750 engineering instruction-response items across 9 fields to test LLM reasoning and agent tool-use

February 16, 20267 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

0

Authors

MZ Naser, Ahmad Bani Awwad, Zoie McCreery, Radwa Eissa, Ahmad Naser, Gianluca Cusatis, Andrew Metcalf, Kapil Madathil, Jamal Abdalla, Venkatesh Kodur, Mohammad Reza Saeb

Links

Abstract / PDF

Why It Matters For Business

ERI lets teams measure engineering-specific failure modes (units, constraints, verification) so you can route tasks to cheaper models when safe and reserve frontier models for high-risk work.

Summary TLDR

ERI is a large, taxonomy-driven benchmark built to test engineering reasoning and instruction following in LLMs and agent pipelines. It covers 9 engineering fields, 55 subdomains, 7 intent types (definition, explanation, calculation, comparison, design, troubleshooting, code), and 3 difficulty tiers, producing 57,750 structured instruction-response records. The authors release JSONL splits, taxonomy specs, validation scripts, and an evaluation harness. In a 10% stratified test, frontier LLMs (GPT-5, Claude Sonnet 4, DeepSeek V3.1) averaged >4.30/5, smaller 7–8B models averaged ~3.0 with >10% failures, and the team bounds hallucination contamination in references to ~1.7%. ERI is meant for R

Problem Statement

General benchmarks miss engineering-specific checks like units, constraints, and verification steps. ERI addresses that gap by forcing structured coverage across field, intent, and difficulty so teams can detect silent failures (e.g., violating constraints or inventing assumptions) and make reliable routing, fine-tuning, or audit decisions.

Main Contribution

A 57,750-item instruction-response dataset with per-item metadata (field, subdomain, intent, difficulty).

A taxonomy covering 9 engineering fields and 55 subdomains to guarantee slice-level coverage.

An evaluation harness with automatic checks, rubric scoring, multi-judge model-as-judge protocol, and verification scripts.

A convergent-validation protocol that triangulates generator, judge, and evaluated models to bound hallucination risk.

Baseline benchmarking across seven LLMs and practical guidance for routing and fine-tuning in engineering workflows.

Key Findings

ERI contains 57,750 instruction-response records with explicit metadata.

Numbers57,750 records; 1,155 metadata cells (50 per cell)

Frontier models deliver near-expert average quality on ERI.

NumbersGPT-5 mean 4.48/5 (σ=0.49); top-3 >4.30

Smaller 7–8B models show steeper degradation and higher failure rates.

NumbersQwen 2.5 7B mean 3.27; Llama 3.1 8B mean 2.96; failure rates >10%

Convergent validation limits hallucinated reference contamination.

NumbersItems where frontier ≤3.0: 1.7% (upper bound on hallucination)

Multi-judge model-as-judge protocol produces moderate-to-strong agreement.

NumbersSpearman ρ between judges 0.70–0.80; >85% within-one-point agreement

Results

Dataset size

Value57,750 records (1,155 cells; 50 samples per cell)

Top model mean score

ValueGPT-5 mean 4.48/5 (σ=0.49)

Frontier models mean

ValueTop-3 models >4.30 mean

Mid / lower-tier model means

ValueMistral 7B 3.81; Llama 3.3 70B 3.75; Qwen 2.5 7B 3.27; Llama 3.1 8B 2.96

Failure rate (score ≤2)

Value>10% for 7–8B models; <1% for frontier models

Hallucination bound in references

Value≤1.7% items flagged

Judge agreement

ValueSpearman ρ = 0.70–0.80; >85% within-one-point

Who Should Care

What To Try In 7 Days

Load the ERI test split from HuggingFace and run your model on a few targeted slices (e.g., CALC, DES).

Use the provided evaluation harness to collect automatic checks and rubric scores for those slices.

Set up a simple router: route undergraduate CALC to a cheaper model and graduate/DES to a stronger model; measure cost vs. failure rate change.

Agent Features

Planning

  • tool planning evaluation (calculator/solver integration)

Tool Use

  • calculator/solver/checker integration
  • planned tool-trace extension for agent evaluation

Frameworks

  • LLM-as-a-judge
  • Self-Instruct generation

Is Agentic

true

Architectures

  • MoE
  • model-as-judge panel

Collaboration

  • multi-judge aggregation across providers

Optimization Features

Training Optimization

  • SFT
  • LoRA

Inference Optimization

  • routing low-rigor slices to smaller models to save cost

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Synthetic generation inherits generator model biases and temporal anchoring.
  • Not a substitute for jurisdiction-specific code compliance or stamped engineering review.
  • Coverage excludes some engineering disciplines (e.g., biomedical, nuclear) and is fixed to chosen subdomains.
  • Optimization risk: teams can overfit to ERI prompt styles and inflate scores.

When Not To Use

  • Certifying safety or legal compliance
  • Directly automating high-stakes control without human oversight
  • Assuming coverage of engineering topics outside the 55 subdomains

Failure Modes

  • Silent constraint violations (units, bounds) despite plausible prose
  • Overfitting to benchmark phrasing (teaching to the test)
  • Reference errors from synthetic generator on time-sensitive code or standards
  • Judge bias when relying on a single judge model

Core Entities

Models

  • GPT-5
  • Claude Sonnet 4
  • DeepSeek V3.1
  • Claude Haiku 4.5
  • GPT-4.1 Mini
  • Mistral Small 3
  • Mistral 7B
  • Llama 3.3 70B
  • Qwen 2.5 7B
  • Llama 3.1 8B
  • GPT-5.1

Metrics

  • mean score (1-5)
  • failure rate (score ≤2)
  • Spearman inter-judge correlation
  • σ (per-item standard deviation)
  • perfect-score rate

Datasets

  • ERI-Benchmark (mznaser/ERI-Benchmark on HuggingFace)

Benchmarks

  • MMLU
  • BIG-bench
  • GSM8K
  • EngDesign
  • SoM-1K