Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
0
Why It Matters For Business
ERI lets teams measure engineering-specific failure modes (units, constraints, verification) so you can route tasks to cheaper models when safe and reserve frontier models for high-risk work.
Summary TLDR
ERI is a large, taxonomy-driven benchmark built to test engineering reasoning and instruction following in LLMs and agent pipelines. It covers 9 engineering fields, 55 subdomains, 7 intent types (definition, explanation, calculation, comparison, design, troubleshooting, code), and 3 difficulty tiers, producing 57,750 structured instruction-response records. The authors release JSONL splits, taxonomy specs, validation scripts, and an evaluation harness. In a 10% stratified test, frontier LLMs (GPT-5, Claude Sonnet 4, DeepSeek V3.1) averaged >4.30/5, smaller 7–8B models averaged ~3.0 with >10% failures, and the team bounds hallucination contamination in references to ~1.7%. ERI is meant for R
Problem Statement
General benchmarks miss engineering-specific checks like units, constraints, and verification steps. ERI addresses that gap by forcing structured coverage across field, intent, and difficulty so teams can detect silent failures (e.g., violating constraints or inventing assumptions) and make reliable routing, fine-tuning, or audit decisions.
Main Contribution
A 57,750-item instruction-response dataset with per-item metadata (field, subdomain, intent, difficulty).
A taxonomy covering 9 engineering fields and 55 subdomains to guarantee slice-level coverage.
An evaluation harness with automatic checks, rubric scoring, multi-judge model-as-judge protocol, and verification scripts.
A convergent-validation protocol that triangulates generator, judge, and evaluated models to bound hallucination risk.
Baseline benchmarking across seven LLMs and practical guidance for routing and fine-tuning in engineering workflows.
Key Findings
ERI contains 57,750 instruction-response records with explicit metadata.
Frontier models deliver near-expert average quality on ERI.
Smaller 7–8B models show steeper degradation and higher failure rates.
Convergent validation limits hallucinated reference contamination.
Multi-judge model-as-judge protocol produces moderate-to-strong agreement.
Results
Dataset size
Top model mean score
Frontier models mean
Mid / lower-tier model means
Failure rate (score ≤2)
Hallucination bound in references
Judge agreement
Who Should Care
What To Try In 7 Days
Load the ERI test split from HuggingFace and run your model on a few targeted slices (e.g., CALC, DES).
Use the provided evaluation harness to collect automatic checks and rubric scores for those slices.
Set up a simple router: route undergraduate CALC to a cheaper model and graduate/DES to a stronger model; measure cost vs. failure rate change.
Agent Features
Planning
- tool planning evaluation (calculator/solver integration)
Tool Use
- calculator/solver/checker integration
- planned tool-trace extension for agent evaluation
Frameworks
- LLM-as-a-judge
- Self-Instruct generation
Is Agentic
true
Architectures
- MoE
- model-as-judge panel
Collaboration
- multi-judge aggregation across providers
Optimization Features
Training Optimization
- SFT
- LoRA
Inference Optimization
- routing low-rigor slices to smaller models to save cost
Reproducibility
Code Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Synthetic generation inherits generator model biases and temporal anchoring.
- Not a substitute for jurisdiction-specific code compliance or stamped engineering review.
- Coverage excludes some engineering disciplines (e.g., biomedical, nuclear) and is fixed to chosen subdomains.
- Optimization risk: teams can overfit to ERI prompt styles and inflate scores.
When Not To Use
- Certifying safety or legal compliance
- Directly automating high-stakes control without human oversight
- Assuming coverage of engineering topics outside the 55 subdomains
Failure Modes
- Silent constraint violations (units, bounds) despite plausible prose
- Overfitting to benchmark phrasing (teaching to the test)
- Reference errors from synthetic generator on time-sensitive code or standards
- Judge bias when relying on a single judge model
Core Entities
Models
- GPT-5
- Claude Sonnet 4
- DeepSeek V3.1
- Claude Haiku 4.5
- GPT-4.1 Mini
- Mistral Small 3
- Mistral 7B
- Llama 3.3 70B
- Qwen 2.5 7B
- Llama 3.1 8B
- GPT-5.1
Metrics
- mean score (1-5)
- failure rate (score ≤2)
- Spearman inter-judge correlation
- σ (per-item standard deviation)
- perfect-score rate
Datasets
- ERI-Benchmark (mznaser/ERI-Benchmark on HuggingFace)
Benchmarks
- MMLU
- BIG-bench
- GSM8K
- EngDesign
- SoM-1K

