Overview
Good evidence: experiments on 1,000 tasks and 42 models show consistent gaps. Some environment dependence and dataset scope limit generality to Python/LeetCode tasks.
Citations4
Evidence Strength0.90
Confidence0.85
Risk Signals9
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 5/6
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 75%
Production readiness: 80%
Novelty: 65%
Why It Matters For Business
Model-generated code can be functionally correct but significantly slower and more memory-hungry; this raises real costs in production, cloud bills, latency-sensitive services, and energy footprint.
Who Should Care
Summary TLDR
EFFIBENCH is a new benchmark of 1,000 LeetCode problems plus human canonical solutions and an automated test generator that measures execution time and memory of LLM-generated Python code. The authors run 42 models (35 open, 7 closed) and show that model-generated code is commonly slower and uses more memory than optimized human solutions (e.g., GPT-4 median code is ~3.12x slower on these tasks; worst cases exceed 13.9x time and 43.9x total memory). The repo and a Hugging Face leaderboard are public.
Problem Statement
Existing code-generation benchmarks measure correctness but not runtime or memory. EFFIBENCH collects algorithmic, efficiency-critical LeetCode tasks, pairs each with an executable human 'canonical' solution, and evaluates LLM outputs on execution time and memory under many test cases to quantify efficiency gaps.
Main Contribution
EFFIBENCH: a benchmark of 1,000 efficiency-critical LeetCode problems with executable human canonical solutions.
A test-case generator and automated pipeline that measures execution time (ET) and multiple memory metrics (MU, TMU) and their normalized forms.
Key Findings
Model-generated code is usually slower than optimized human solutions.
Extreme inefficiencies occur on some tasks.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Average normalized execution time (NET) | GPT-4: 3.12x (generated / canonical) | canonical solution = 1x | ≈ +2.12x | EFFIBENCH (1000 tasks, 100 tests each default) | Table 3 reports NET for GPT-4 = 3.12 | Table 3 |
| Maximum normalized execution time (max NET) | GPT-4: 13.89x (worst-case among correct solutions) | canonical solution = 1x | ≈ +12.89x | EFFIBENCH (max over tasks) | Table 3 shows GPT-4 max NET = 13.89 | Table 3 |
What To Try In 7 Days
Run EFFIBENCH on a representative subset of your code-generation pipeline to measure NET and NTMU for critical functions.
Add a lightweight profiler step after model output: reject or flag implementations whose NET or NTMU exceed a threshold (e.g., 2x).
Use model outputs as drafts: apply automated heuristics or a small edit model to replace obvious inefficient patterns (e.g., avoid naive sorts or full DP matrices).
Reproducibility
Code URLs
Data URLs
Risks & Boundaries
Limitations
Only Python is supported; other languages not included (Appendix A.1).
Dataset is LeetCode-focused and favors algorithmic problems, not real-world codebases.
When Not To Use
When evaluating non-algorithmic or application-level code (web services, GUIs).
When you need language coverage beyond Python (C++, Java, JS, Go).
Failure Modes
LLMs output functionally correct code but with much worse time or memory complexity.
Limited test-case coverage can hide inefficiencies; results depend on chosen tests.

