Overview
Production Readiness
0.8
Novelty Score
0.65
Cost Impact Score
0.75
Citation Count
4
Why It Matters For Business
Model-generated code can be functionally correct but significantly slower and more memory-hungry; this raises real costs in production, cloud bills, latency-sensitive services, and energy footprint.
Summary TLDR
EFFIBENCH is a new benchmark of 1,000 LeetCode problems plus human canonical solutions and an automated test generator that measures execution time and memory of LLM-generated Python code. The authors run 42 models (35 open, 7 closed) and show that model-generated code is commonly slower and uses more memory than optimized human solutions (e.g., GPT-4 median code is ~3.12x slower on these tasks; worst cases exceed 13.9x time and 43.9x total memory). The repo and a Hugging Face leaderboard are public.
Problem Statement
Existing code-generation benchmarks measure correctness but not runtime or memory. EFFIBENCH collects algorithmic, efficiency-critical LeetCode tasks, pairs each with an executable human 'canonical' solution, and evaluates LLM outputs on execution time and memory under many test cases to quantify efficiency gaps.
Main Contribution
EFFIBENCH: a benchmark of 1,000 efficiency-critical LeetCode problems with executable human canonical solutions.
A test-case generator and automated pipeline that measures execution time (ET) and multiple memory metrics (MU, TMU) and their normalized forms.
A large empirical study of 42 LLMs showing generated code is often much less efficient than canonical solutions, and a public repo plus Hugging Face leaderboard.
Key Findings
Model-generated code is usually slower than optimized human solutions.
Extreme inefficiencies occur on some tasks.
Top performers still lag the canonical baseline.
Correctness score (pass@1) does not guarantee efficiency.
Results
Average normalized execution time (NET)
Maximum normalized execution time (max NET)
Average normalized total memory usage (NTMU)
Worst-case normalized total memory (max NTMU)
Best open-source model average NET
Correctness (pass@1) vs efficiency
Who Should Care
What To Try In 7 Days
Run EFFIBENCH on a representative subset of your code-generation pipeline to measure NET and NTMU for critical functions.
Add a lightweight profiler step after model output: reject or flag implementations whose NET or NTMU exceed a threshold (e.g., 2x).
Use model outputs as drafts: apply automated heuristics or a small edit model to replace obvious inefficient patterns (e.g., avoid naive sorts or full DP matrices).
Reproducibility
Code Urls
Data Urls
- LeetCode (problems used) via LeetCode site
- Dataset artifacts in GitHub repo (canonical solutions and test generators)
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Only Python is supported; other languages not included (Appendix A.1).
- Dataset is LeetCode-focused and favors algorithmic problems, not real-world codebases.
- Efficiency numbers depend on the execution environment; absolute values may change.
When Not To Use
- When evaluating non-algorithmic or application-level code (web services, GUIs).
- When you need language coverage beyond Python (C++, Java, JS, Go).
- When you only care about correctness and not runtime/memory cost.
Failure Modes
- LLMs output functionally correct code but with much worse time or memory complexity.
- Limited test-case coverage can hide inefficiencies; results depend on chosen tests.
- Environment variation may shift absolute rankings; reproduction requires matched environment.
Core Entities
Models
- gpt-4
- gpt-4-turbo-preview
- gpt-3.5-turbo-0301
- gpt-3.5-turbo-0613
- gpt-3.5-turbo-1106
- claude-3-haiku
- claude-3-sonnet
- starcoder2-15b
- starcoder2-7b
- starcoder
- CodeLlama-70b
- CodeLlama-34b
- OpenCodeInterpreter-DS-33B
- deepseek-coder-33b
Metrics
- Execution Time (ET)
- Normalized Execution Time (NET)
- Max Memory Usage (MU)
- Normalized Max Memory Usage (NMU)
- Total Memory Usage (TMU)
- Normalized Total Memory Usage (NTMU)
- pass@1
Datasets
- LeetCode (collected problems)
- HumanEval
- MBPP
- HumanEvalPlus
- MBPPPlus
- DS-1000
Benchmarks
- EFFIBENCH
- HumanEval
- MBPP
- APPS
- DS-1000

