EFFIBENCH: a 1,000-problem benchmark that measures runtime and memory of LLM-generated Python solutions

February 3, 20247 min

Overview

Production Readiness

0.8

Novelty Score

0.65

Cost Impact Score

0.75

Citation Count

4

Authors

Dong Huang, Yuhao Qing, Weiyi Shang, Heming Cui, Jie M. Zhang

Links

Abstract / PDF

Why It Matters For Business

Model-generated code can be functionally correct but significantly slower and more memory-hungry; this raises real costs in production, cloud bills, latency-sensitive services, and energy footprint.

Summary TLDR

EFFIBENCH is a new benchmark of 1,000 LeetCode problems plus human canonical solutions and an automated test generator that measures execution time and memory of LLM-generated Python code. The authors run 42 models (35 open, 7 closed) and show that model-generated code is commonly slower and uses more memory than optimized human solutions (e.g., GPT-4 median code is ~3.12x slower on these tasks; worst cases exceed 13.9x time and 43.9x total memory). The repo and a Hugging Face leaderboard are public.

Problem Statement

Existing code-generation benchmarks measure correctness but not runtime or memory. EFFIBENCH collects algorithmic, efficiency-critical LeetCode tasks, pairs each with an executable human 'canonical' solution, and evaluates LLM outputs on execution time and memory under many test cases to quantify efficiency gaps.

Main Contribution

EFFIBENCH: a benchmark of 1,000 efficiency-critical LeetCode problems with executable human canonical solutions.

A test-case generator and automated pipeline that measures execution time (ET) and multiple memory metrics (MU, TMU) and their normalized forms.

A large empirical study of 42 LLMs showing generated code is often much less efficient than canonical solutions, and a public repo plus Hugging Face leaderboard.

Key Findings

Model-generated code is usually slower than optimized human solutions.

NumbersGPT-4 average NET = 3.12x (generated time / canonical time)

Extreme inefficiencies occur on some tasks.

NumbersMax observed NET = 13.89x and max observed dynamic memory (NTMU) = 43.92x for GPT-4

Top performers still lag the canonical baseline.

NumbersBest open-source model (starcoder2-15B) avg NET = 2.59x; best closed-source (GPT-4) avg NET = 3.12x

Correctness score (pass@1) does not guarantee efficiency.

NumbersGPT-4-turbo-preview has higher pass@1 (65.4%) than GPT-4 (50.8%) but lower code efficiency on some metrics

Results

Average normalized execution time (NET)

ValueGPT-4: 3.12x (generated / canonical)

Baselinecanonical solution = 1x

Maximum normalized execution time (max NET)

ValueGPT-4: 13.89x (worst-case among correct solutions)

Baselinecanonical solution = 1x

Average normalized total memory usage (NTMU)

ValueGPT-4: 6.36x

Baselinecanonical solution = 1x

Worst-case normalized total memory (max NTMU)

ValueGPT-4: 43.92x (single-task extreme)

Baselinecanonical solution = 1x

Best open-source model average NET

Valuestarcoder2-15b: 2.59x

Baselinecanonical solution = 1x

Correctness (pass@1) vs efficiency

ValueGPT-4-turbo-preview pass@1 = 65.4% but NET worse than some models

Baselinepass@1 and NET are independent

Who Should Care

What To Try In 7 Days

Run EFFIBENCH on a representative subset of your code-generation pipeline to measure NET and NTMU for critical functions.

Add a lightweight profiler step after model output: reject or flag implementations whose NET or NTMU exceed a threshold (e.g., 2x).

Use model outputs as drafts: apply automated heuristics or a small edit model to replace obvious inefficient patterns (e.g., avoid naive sorts or full DP matrices).

Reproducibility

Data Urls

  • LeetCode (problems used) via LeetCode site
  • Dataset artifacts in GitHub repo (canonical solutions and test generators)

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Only Python is supported; other languages not included (Appendix A.1).
  • Dataset is LeetCode-focused and favors algorithmic problems, not real-world codebases.
  • Efficiency numbers depend on the execution environment; absolute values may change.

When Not To Use

  • When evaluating non-algorithmic or application-level code (web services, GUIs).
  • When you need language coverage beyond Python (C++, Java, JS, Go).
  • When you only care about correctness and not runtime/memory cost.

Failure Modes

  • LLMs output functionally correct code but with much worse time or memory complexity.
  • Limited test-case coverage can hide inefficiencies; results depend on chosen tests.
  • Environment variation may shift absolute rankings; reproduction requires matched environment.

Core Entities

Models

  • gpt-4
  • gpt-4-turbo-preview
  • gpt-3.5-turbo-0301
  • gpt-3.5-turbo-0613
  • gpt-3.5-turbo-1106
  • claude-3-haiku
  • claude-3-sonnet
  • starcoder2-15b
  • starcoder2-7b
  • starcoder
  • CodeLlama-70b
  • CodeLlama-34b
  • OpenCodeInterpreter-DS-33B
  • deepseek-coder-33b

Metrics

  • Execution Time (ET)
  • Normalized Execution Time (NET)
  • Max Memory Usage (MU)
  • Normalized Max Memory Usage (NMU)
  • Total Memory Usage (TMU)
  • Normalized Total Memory Usage (NTMU)
  • pass@1

Datasets

  • LeetCode (collected problems)
  • HumanEval
  • MBPP
  • HumanEvalPlus
  • MBPPPlus
  • DS-1000

Benchmarks

  • EFFIBENCH
  • HumanEval
  • MBPP
  • APPS
  • DS-1000