Overview
ERI is ready as a regression and diagnostic tool for engineering LLM evaluation; use it with human review and slice audits before trusting outputs for high-stakes decisions.
Citations0
Evidence Strength0.80
Confidence0.90
Risk Signals11
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 0/7
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 70%
Production readiness: 70%
Novelty: 60%
Why It Matters For Business
ERI lets teams measure engineering-specific failure modes (units, constraints, verification) so you can route tasks to cheaper models when safe and reserve frontier models for high-risk work.
Who Should Care
Summary TLDR
ERI is a large, taxonomy-driven benchmark built to test engineering reasoning and instruction following in LLMs and agent pipelines. It covers 9 engineering fields, 55 subdomains, 7 intent types (definition, explanation, calculation, comparison, design, troubleshooting, code), and 3 difficulty tiers, producing 57,750 structured instruction-response records. The authors release JSONL splits, taxonomy specs, validation scripts, and an evaluation harness. In a 10% stratified test, frontier LLMs (GPT-5, Claude Sonnet 4, DeepSeek V3.1) averaged >4.30/5, smaller 7–8B models averaged ~3.0 with >10% failures, and the team bounds hallucination contamination in references to ~1.7%. ERI is meant for R
Problem Statement
General benchmarks miss engineering-specific checks like units, constraints, and verification steps. ERI addresses that gap by forcing structured coverage across field, intent, and difficulty so teams can detect silent failures (e.g., violating constraints or inventing assumptions) and make reliable routing, fine-tuning, or audit decisions.
Main Contribution
A 57,750-item instruction-response dataset with per-item metadata (field, subdomain, intent, difficulty).
A taxonomy covering 9 engineering fields and 55 subdomains to guarantee slice-level coverage.
Key Findings
ERI contains 57,750 instruction-response records with explicit metadata.
Frontier models deliver near-expert average quality on ERI.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Dataset size | 57,750 records (1,155 cells; 50 samples per cell) | — | — | full | Sec 4.1; Abstract | Sec 4.1 |
| Top model mean score | GPT-5 mean 4.48/5 (σ=0.49) | — | — | 10% stratified test | Sec 6.2, Fig.3 | Sec 6.2 |
What To Try In 7 Days
Load the ERI test split from HuggingFace and run your model on a few targeted slices (e.g., CALC, DES).
Use the provided evaluation harness to collect automatic checks and rubric scores for those slices.
Set up a simple router: route undergraduate CALC to a cheaper model and graduate/DES to a stronger model; measure cost vs. failure rate change.
Agent Features
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Optimization Features
Training Optimization
Inference Optimization
Reproducibility
Code URLs
Risks & Boundaries
Limitations
Synthetic generation inherits generator model biases and temporal anchoring.
Not a substitute for jurisdiction-specific code compliance or stamped engineering review.
When Not To Use
Certifying safety or legal compliance
Directly automating high-stakes control without human oversight
Failure Modes
Silent constraint violations (units, bounds) despite plausible prose
Overfitting to benchmark phrasing (teaching to the test)

