EFFIBENCH: a 1,000-problem benchmark that measures runtime and memory of LLM-generated Python solutions

Overview

Decision SnapshotReady For Pilot

Good evidence: experiments on 1,000 tasks and 42 models show consistent gaps. Some environment dependence and dataset scope limit generality to Python/LeetCode tasks.

Citations4

Evidence Strength0.90

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 5/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 75%

Production readiness: 80%

Novelty: 65%

Authors

Dong Huang, Yuhao Qing, Weiyi Shang, Heming Cui, Jie M. Zhang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Model-generated code can be functionally correct but significantly slower and more memory-hungry; this raises real costs in production, cloud bills, latency-sensitive services, and energy footprint.

Who Should Care

Product Manager ML Engineer Engineering Lead Founder CTO

Summary TLDR

EFFIBENCH is a new benchmark of 1,000 LeetCode problems plus human canonical solutions and an automated test generator that measures execution time and memory of LLM-generated Python code. The authors run 42 models (35 open, 7 closed) and show that model-generated code is commonly slower and uses more memory than optimized human solutions (e.g., GPT-4 median code is ~3.12x slower on these tasks; worst cases exceed 13.9x time and 43.9x total memory). The repo and a Hugging Face leaderboard are public.

Problem Statement

Existing code-generation benchmarks measure correctness but not runtime or memory. EFFIBENCH collects algorithmic, efficiency-critical LeetCode tasks, pairs each with an executable human 'canonical' solution, and evaluates LLM outputs on execution time and memory under many test cases to quantify efficiency gaps.

Main Contribution

EFFIBENCH: a benchmark of 1,000 efficiency-critical LeetCode problems with executable human canonical solutions.

A test-case generator and automated pipeline that measures execution time (ET) and multiple memory metrics (MU, TMU) and their normalized forms.

Key Findings

Model-generated code is usually slower than optimized human solutions.

NumbersGPT-4 average NET = 3.12x (generated time / canonical time)

Practical UseExpect model-completed Python functions to run multiple times slower than textbook implementations; add profiling and optimization before production use.

Evidence RefAbstract; Table 3 (GPT-4 NET=3.12)

Extreme inefficiencies occur on some tasks.

NumbersMax observed NET = 13.89x and max observed dynamic memory (NTMU) = 43.92x for GPT-4

Practical UseSpot-check worst-case tasks and add resource/time limits; do not trust correctness alone as proof of acceptable cost.

Evidence RefAbstract; Table 3 (GPT-4 max NET=13.89, max NTMU=43.92)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Average normalized execution time (NET)	GPT-4: 3.12x (generated / canonical)	canonical solution = 1x	≈ +2.12x	EFFIBENCH (1000 tasks, 100 tests each default)	Table 3 reports NET for GPT-4 = 3.12	Table 3
Maximum normalized execution time (max NET)	GPT-4: 13.89x (worst-case among correct solutions)	canonical solution = 1x	≈ +12.89x	EFFIBENCH (max over tasks)	Table 3 shows GPT-4 max NET = 13.89	Table 3

What To Try In 7 Days

Run EFFIBENCH on a representative subset of your code-generation pipeline to measure NET and NTMU for critical functions.

Add a lightweight profiler step after model output: reject or flag implementations whose NET or NTMU exceed a threshold (e.g., 2x).

Use model outputs as drafts: apply automated heuristics or a small edit model to replace obvious inefficient patterns (e.g., avoid naive sorts or full DP matrices).

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/huangd1999/EffiBench https://huggingface.co/spaces/EffiBench/effibench-leaderboard

Data URLs

LeetCode (problems used) via LeetCode siteDataset artifacts in GitHub repo (canonical solutions and test generators)

Risks & Boundaries

Limitations

Only Python is supported; other languages not included (Appendix A.1).

Dataset is LeetCode-focused and favors algorithmic problems, not real-world codebases.

When Not To Use

When evaluating non-algorithmic or application-level code (web services, GUIs).

When you need language coverage beyond Python (C++, Java, JS, Go).

Failure Modes

LLMs output functionally correct code but with much worse time or memory complexity.

Limited test-case coverage can hide inefficiencies; results depend on chosen tests.

Core Entities

Models

gpt-4gpt-4-turbo-previewgpt-3.5-turbo-0301gpt-3.5-turbo-0613gpt-3.5-turbo-1106claude-3-haikuclaude-3-sonnetstarcoder2-15bstarcoder2-7bstarcoderCodeLlama-70bCodeLlama-34bOpenCodeInterpreter-DS-33Bdeepseek-coder-33b

Metrics

Execution Time (ET)Normalized Execution Time (NET)Max Memory Usage (MU)Normalized Max Memory Usage (NMU)Total Memory Usage (TMU)Normalized Total Memory Usage (NTMU)pass@1

Datasets

LeetCode (collected problems)HumanEvalMBPPHumanEvalPlusMBPPPlusDS-1000

Benchmarks

EFFIBENCHHumanEvalMBPPAPPSDS-1000

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Model-generated code is usually slower than optimized human solutions.

Extreme inefficiencies occur on some tasks.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding