BigCodeBench: a 1,140-task Python benchmark testing multi-tool function calls and complex instructions

June 22, 20248 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

15

Authors

Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, Simon Brunner, Chen Gong, Thong Hoang, Armel Randy Zebaze, Xiaoheng Hong, Wen-Ding Li, Jean Kaddour, Ming Xu, Zhihan Zhang, Prateek Yadav, Naman Jain, Alex Gu, Zhoujun Cheng, Jiawei Liu, Qian Liu, Zijian Wang, Binyuan Hui, Niklas Muennighoff, David Lo, Daniel Fried, Xiaoning Du, Harm de Vries, Leandro Von Werra

Links

Abstract / PDF

Why It Matters For Business

BigCodeBench reveals real gaps in LLMs for practical coding: models commonly mis-use APIs, omit setup, and perform worse on concise human instructions, so production systems should include execution tests, human review, and domain-specialized models.

Summary TLDR

BigCodeBench is a new execution-based Python benchmark of 1,140 fine-grained tasks that require using many libraries (139) and composing multiple function calls. Each task has rigorous unit tests (avg 5.6 tests, 99% branch coverage). The authors build tasks via a human+LLM pipeline, provide an NL-oriented variant (Instruct), and run 60 models. Top models reach ~60% Pass@1 on the structured prompts and drop ~8.5% on the natural-language prompts; human annotators pass 97% in a spot check. The benchmark highlights tool-use gaps, instruction-following failures, and flaky tests as practical evaluation challenges.

Problem Statement

Existing code benchmarks focus on short, self-contained problems or limited APIs. Real-world programming needs compositional tool use (many library function calls) and the ability to follow complex, concise human instructions. BigCodeBench measures how well LLMs generate correct Python code that invokes multiple libraries and satisfies strict runtime tests.

Main Contribution

A high-quality Python benchmark (BigCodeBench) of 1,140 tasks that require multi-library function-call sequences across 7 domains and 139 libraries.

Rigorous execution-based tests: average 5.6 test cases per task and 99% branch coverage for ground-truth solutions.

An NL-oriented variant (BigCodeBench-Instruct) that converts docstrings into concise natural instructions to test instruction-following.

A human + LLM construction pipeline for task synthesis, iterative refactoring, and test generation with manual curation to reduce ambiguity.

Extensive evaluation of 60 LLMs showing top model performance far from humans and specific failure modes (omitted imports, wrong API use, domain gaps).

Key Findings

Top model solves roughly 60% of tasks on structured docstrings.

NumbersPass@1 = 0.602 (GPT-4o, Complete)

Performance drops when prompts are condensed to natural-language style.

NumbersAvg Pass@1 drop = 8.5% (Complete → Instruct)

Human-curated ground-truth solutions pass almost all tests in spot checks.

Numbers97% pass rate (32/33 sampled tasks)

Models often use different function calls than ground truth and mismatch at function-call level.

NumbersFunction-call overlap: Sol. ⊆ GT = 40.46% (mean)

Benchmark covers more tools and complexity than existing sets.

Numbers723 distinct function calls; solution length avg 426 chars; 1,140 tasks

Results

Pass@1 (structured prompts)

Value0.602 (GPT-4o, greedy original)

Pass@1 (NL-oriented prompts)

Value0.499 (GPT-4o, greedy original)

Baseline0.602 (Complete, same model)

Human sample pass rate

Value0.97 (32/33 sampled tasks pass)

Who Should Care

What To Try In 7 Days

Run BigCodeBench-Hard (small subset) on your model to sanity-check tool-use and instruction-following.

Add automated calibration to patch omitted setup (imports/constants) before execution runs.

Prioritize tests that exercise network and multi-library code paths; these are frequent failure points here.

Reproducibility

License

  • Apache-2.0

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Python-only: not directly applicable to other programming languages.
  • Some unit tests can be flaky (network/timeouts); results may vary slightly across runs.
  • Benchmark covers many popular libraries but not all domain-specific or newly emerging libs.
  • Human+LLM construction can still leave subtle ambiguities despite curation.

When Not To Use

  • If you only need short algorithmic correctness checks (use HumanEval instead).
  • For ultra-low-budget quick model sanity checks—run the Hard subset instead.
  • When you cannot run sandboxed executions or lack reproducible runtime environments.

Failure Modes

  • Model 'laziness': omitting imports/constants when asked to reproduce long contexts, causing false negatives.
  • Wrong API choice: using different function calls that are semantically close but fail tests.
  • Flaky tests due to network or timing issues causing unstable Pass@1 scores.
  • Instruction ambiguity: condensed NL prompts lead to misinterpretation and lower performance.

Core Entities

Models

  • GPT-4o
  • GPT-4-Turbo
  • GPT-4
  • GPT-3.5-Turbo
  • Claude-3
  • Mistral
  • CodeLlama
  • Qwen
  • StarCoder2
  • Granite-Code
  • Mixtral
  • DeepSeek

Metrics

  • Pass@1
  • Pass@5
  • calibrated Pass@1
  • branch coverage
  • solve rate

Datasets

  • BigCodeBench
  • BigCodeBench-Instruct
  • BigCodeBench-Hard
  • HumanEval
  • DS-1000
  • ODEX
  • MBPP
  • APPS

Benchmarks

  • BigCodeBench
  • BigCodeBench - Complete
  • BigCodeBench - Instruct
  • BigCodeBench - Hard
  • HumanEval
  • DS-1000
  • ODEX
  • SWE-bench