EFFIBENCH: a 1,000-problem benchmark that measures runtime and memory of LLM-generated Python solutions

February 3, 20247 min

Overview

Decision SnapshotReady For Pilot

Good evidence: experiments on 1,000 tasks and 42 models show consistent gaps. Some environment dependence and dataset scope limit generality to Python/LeetCode tasks.

Citations4

Evidence Strength0.90

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 5/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 75%

Production readiness: 80%

Novelty: 65%

Authors

Dong Huang, Yuhao Qing, Weiyi Shang, Heming Cui, Jie M. Zhang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Model-generated code can be functionally correct but significantly slower and more memory-hungry; this raises real costs in production, cloud bills, latency-sensitive services, and energy footprint.

Who Should Care

Summary TLDR

EFFIBENCH is a new benchmark of 1,000 LeetCode problems plus human canonical solutions and an automated test generator that measures execution time and memory of LLM-generated Python code. The authors run 42 models (35 open, 7 closed) and show that model-generated code is commonly slower and uses more memory than optimized human solutions (e.g., GPT-4 median code is ~3.12x slower on these tasks; worst cases exceed 13.9x time and 43.9x total memory). The repo and a Hugging Face leaderboard are public.

Problem Statement

Existing code-generation benchmarks measure correctness but not runtime or memory. EFFIBENCH collects algorithmic, efficiency-critical LeetCode tasks, pairs each with an executable human 'canonical' solution, and evaluates LLM outputs on execution time and memory under many test cases to quantify efficiency gaps.

Main Contribution

EFFIBENCH: a benchmark of 1,000 efficiency-critical LeetCode problems with executable human canonical solutions.

A test-case generator and automated pipeline that measures execution time (ET) and multiple memory metrics (MU, TMU) and their normalized forms.

Key Findings

Model-generated code is usually slower than optimized human solutions.

NumbersGPT-4 average NET = 3.12x (generated time / canonical time)

Practical UseExpect model-completed Python functions to run multiple times slower than textbook implementations; add profiling and optimization before production use.

Evidence RefAbstract; Table 3 (GPT-4 NET=3.12)

Extreme inefficiencies occur on some tasks.

NumbersMax observed NET = 13.89x and max observed dynamic memory (NTMU) = 43.92x for GPT-4

Practical UseSpot-check worst-case tasks and add resource/time limits; do not trust correctness alone as proof of acceptable cost.

Evidence RefAbstract; Table 3 (GPT-4 max NET=13.89, max NTMU=43.92)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Average normalized execution time (NET)GPT-4: 3.12x (generated / canonical)canonical solution = 1x≈ +2.12xEFFIBENCH (1000 tasks, 100 tests each default)Table 3 reports NET for GPT-4 = 3.12Table 3
Maximum normalized execution time (max NET)GPT-4: 13.89x (worst-case among correct solutions)canonical solution = 1x≈ +12.89xEFFIBENCH (max over tasks)Table 3 shows GPT-4 max NET = 13.89Table 3

What To Try In 7 Days

Run EFFIBENCH on a representative subset of your code-generation pipeline to measure NET and NTMU for critical functions.

Add a lightweight profiler step after model output: reject or flag implementations whose NET or NTMU exceed a threshold (e.g., 2x).

Use model outputs as drafts: apply automated heuristics or a small edit model to replace obvious inefficient patterns (e.g., avoid naive sorts or full DP matrices).

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

LeetCode (problems used) via LeetCode siteDataset artifacts in GitHub repo (canonical solutions and test generators)

Risks & Boundaries

Limitations

Only Python is supported; other languages not included (Appendix A.1).

Dataset is LeetCode-focused and favors algorithmic problems, not real-world codebases.

When Not To Use

When evaluating non-algorithmic or application-level code (web services, GUIs).

When you need language coverage beyond Python (C++, Java, JS, Go).

Failure Modes

LLMs output functionally correct code but with much worse time or memory complexity.

Limited test-case coverage can hide inefficiencies; results depend on chosen tests.

Core Entities

Models

gpt-4gpt-4-turbo-previewgpt-3.5-turbo-0301gpt-3.5-turbo-0613gpt-3.5-turbo-1106claude-3-haikuclaude-3-sonnetstarcoder2-15bstarcoder2-7bstarcoderCodeLlama-70bCodeLlama-34bOpenCodeInterpreter-DS-33Bdeepseek-coder-33b

Metrics

Execution Time (ET)Normalized Execution Time (NET)Max Memory Usage (MU)Normalized Max Memory Usage (NMU)Total Memory Usage (TMU)Normalized Total Memory Usage (NTMU)pass@1

Datasets

LeetCode (collected problems)HumanEvalMBPPHumanEvalPlusMBPPPlusDS-1000

Benchmarks

EFFIBENCHHumanEvalMBPPAPPSDS-1000