BigCodeBench: a 1,140-task Python benchmark testing multi-tool function calls and complex instructions

June 22, 20248 min

Overview

Decision SnapshotNeeds Validation

This benchmark is ready for model evaluation in research and internal product QA; expect to invest effort in running sandboxed executions and curating flaky tests before using it for high-stakes automation.

Citations15

Evidence Strength0.85

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 1/3

Reproducibility

Status: Code + data available

Open source: Yes

License: Apache-2.0

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 60%

Authors

Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, Simon Brunner, Chen Gong, Thong Hoang, Armel Randy Zebaze, Xiaoheng Hong, Wen-Ding Li, Jean Kaddour, Ming Xu, Zhihan Zhang, Prateek Yadav, Naman Jain, Alex Gu, Zhoujun Cheng, Jiawei Liu, Qian Liu, Zijian Wang, Binyuan Hui, Niklas Muennighoff, David Lo, Daniel Fried, Xiaoning Du, Harm de Vries, Leandro Von Werra

Links

Abstract / PDF / Code / Data

Why It Matters For Business

BigCodeBench reveals real gaps in LLMs for practical coding: models commonly mis-use APIs, omit setup, and perform worse on concise human instructions, so production systems should include execution tests, human review, and domain-specialized models.

Who Should Care

Summary TLDR

BigCodeBench is a new execution-based Python benchmark of 1,140 fine-grained tasks that require using many libraries (139) and composing multiple function calls. Each task has rigorous unit tests (avg 5.6 tests, 99% branch coverage). The authors build tasks via a human+LLM pipeline, provide an NL-oriented variant (Instruct), and run 60 models. Top models reach ~60% Pass@1 on the structured prompts and drop ~8.5% on the natural-language prompts; human annotators pass 97% in a spot check. The benchmark highlights tool-use gaps, instruction-following failures, and flaky tests as practical evaluation challenges.

Problem Statement

Existing code benchmarks focus on short, self-contained problems or limited APIs. Real-world programming needs compositional tool use (many library function calls) and the ability to follow complex, concise human instructions. BigCodeBench measures how well LLMs generate correct Python code that invokes multiple libraries and satisfies strict runtime tests.

Main Contribution

A high-quality Python benchmark (BigCodeBench) of 1,140 tasks that require multi-library function-call sequences across 7 domains and 139 libraries.

Rigorous execution-based tests: average 5.6 test cases per task and 99% branch coverage for ground-truth solutions.

Key Findings

Top model solves roughly 60% of tasks on structured docstrings.

NumbersPass@1 = 0.602 (GPT-4o, Complete)

Practical UseExpect even the best closed models to fail ~40% of practical, multi-tool Python tasks; do not rely solely on models for end-to-end automation without verification.

Evidence RefTable 6; Figure 6

Performance drops when prompts are condensed to natural-language style.

NumbersAvg Pass@1 drop = 8.5% (Complete → Instruct)

Practical UseInstruction-tuned models can struggle with concise human instructions; prefer richer structured prompts or add clarification steps when correctness matters.

Evidence RefSec 4.1, Figure 6

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Pass@1 (structured prompts)0.602 (GPT-4o, greedy original)BigCodeBench - CompleteTable 6; Sec 4.1Table 6
Pass@1 (NL-oriented prompts)0.499 (GPT-4o, greedy original)0.602 (Complete, same model)-0.103BigCodeBench - InstructTable 7; Sec 4.1Table 7

What To Try In 7 Days

Run BigCodeBench-Hard (small subset) on your model to sanity-check tool-use and instruction-following.

Add automated calibration to patch omitted setup (imports/constants) before execution runs.

Prioritize tests that exercise network and multi-library code paths; these are frequent failure points here.

Reproducibility

Risks & Boundaries

Limitations

Python-only: not directly applicable to other programming languages.

Some unit tests can be flaky (network/timeouts); results may vary slightly across runs.

When Not To Use

If you only need short algorithmic correctness checks (use HumanEval instead).

For ultra-low-budget quick model sanity checks—run the Hard subset instead.

Failure Modes

Model 'laziness': omitting imports/constants when asked to reproduce long contexts, causing false negatives.

Wrong API choice: using different function calls that are semantically close but fail tests.

Core Entities

Models

GPT-4oGPT-4-TurboGPT-4GPT-3.5-TurboClaude-3MistralCodeLlamaQwenStarCoder2Granite-CodeMixtralDeepSeek

Metrics

Pass@1Pass@5calibrated Pass@1branch coveragesolve rate

Datasets

BigCodeBenchBigCodeBench-InstructBigCodeBench-HardHumanEvalDS-1000ODEXMBPPAPPS

Benchmarks

BigCodeBenchBigCodeBench - CompleteBigCodeBench - InstructBigCodeBench - HardHumanEvalDS-1000ODEXSWE-bench