CoSQA+: a large multi-choice code-search dataset built by test-driven agent annotations (412k pairs, agent accuracy 93.9%)

June 17, 20247 min

Overview

Production Readiness

0.7

Novelty Score

0.65

Cost Impact Score

0.75

Citation Count

2

Authors

Jing Gong, Yanghui Wu, Linxi Liang, Yanlin Wang, Jiachi Chen, Mingwei Liu, Zibin Zheng

Links

Abstract / PDF

Why It Matters For Business

CoSQA+ reduces expensive human labeling by using test-driven agents to create a large, functionally verified multi-choice code-search dataset that improves retrieval performance and cuts annotation cost to roughly $0.00062 per sample.

Summary TLDR

This paper builds CoSQA+, a multi-choice code-search dataset (412,080 query–code pairs) labeled by an automated test-driven agent pipeline. The pipeline generates tests, runs them in Docker, fixes dependency/runtime issues, and uses a final arbiter to decide matches. The agents achieve 93.9% accuracy versus human experts on a 1,000-sample ground truth, and generated tests execute at an 83.67% rate. Models fine-tuned on CoSQA+ (CodeBERT, UniXcoder, CodeT5+) show consistent MAP@10 and MRR gains on a CSN Python test set compared to CoSQA.

Problem Statement

Existing code-search benchmarks treat search as one-to-one and rely on humans or LLM semantics, not functional verification. Real developers often need multiple correct snippets per query (survey: 63.2% of queries). This mismatch harms evaluation and training quality and limits scalability and accuracy.

Main Contribution

CoSQA+ dataset: 412,080 agent-labeled query–code pairs and a 1,000-sample human-verified subset.

A fully automated test-driven annotation pipeline (screener, test generator, executor, bug fixer, arbiter).

Demonstrated agent accuracy (93.9%) and 83.67% executable test program rate.

Empirical evidence that training on CoSQA+ improves MAP@10 and MRR over CoSQA.

Key Findings

CoSQA+ provides 412,080 labeled query–code pairs.

Numbers412,080 pairs; 132,952 unique codes

Test-driven agents label with high agreement to human experts.

NumbersAgent accuracy 93.9% ±0.020 vs expert ground truth

Most generated tests run successfully.

NumbersExecutable test rate 83.67%

CoSQA+ improves model training outcomes versus CoSQA.

NumbersCodeBERT MAP@10 0.900→0.938; MRR 0.939→0.966 (on CSN Python)

Developers often need multiple examples per query.

NumbersSurvey: 63.2% of queries yield multiple valid snippets; 66.5% need 2–3 examples

Results

Accuracy

Value0.939

Baselinehuman without tests 0.891

Executable test program rate

Value83.67%

Dataset size

Value412,080 pairs

BaselineCoSQA 20,604+ pairs (original)

CodeBERT MAP@10 (fine-tuned)

Value0.938

BaselineCoSQA fine-tune 0.900

CodeBERT MRR (fine-tuned)

Value0.966

BaselineCoSQA fine-tune 0.939

Who Should Care

What To Try In 7 Days

Download CoSQA+ verified subset and run your model's evaluation to compare MAP@10 vs your current benchmark.

Fine-tune a small retrieval model (e.g., CodeBERT or UniXcoder) on CoSQA+ training split and measure MAP@10 and MRR on a held-out CSN Python test.

Run the open-source test-driven agent on a 100–500 sample subset of your codebase to estimate executable-test rate and annotation cost.

Agent Features

Memory

  • short-term execution traces and logs used per sample

Planning

  • multi-stage triage (screener → test generation → execution → bug fixing → arbiter)
  • iterative dependency installation and test rerun

Tool Use

  • Docker for safe execution
  • package managers for dependency fixes
  • LLM calls to synthesize repair/install commands

Frameworks

  • DeepSeek-V3
  • custom test generation and arbiter prompts

Is Agentic

true

Architectures

  • LLM-driven modular agent pipeline
  • embedder ensemble for candidate selection

Collaboration

  • single automated pipeline with modular stages (not multi-agent negotiation)

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Query ambiguity: vague queries lead to multiple valid interpretations and mislabels.
  • Test misjudgment risk: correct code outside 20-candidate set may be missed.
  • Dependence on runnable code: requires code to execute in an isolated environment.
  • Partial human verification: only 1,000 pairs human-verified due to cost.

When Not To Use

  • Datasets with closed-source or non-executable code (no runtime available).
  • Queries that require design rationale or non-functional aspects (style, performance trade-offs).
  • Contexts that demand exhaustive human-level semantic nuance.

Failure Modes

  • Ambiguous natural language leads to wrong test generation.
  • Checker and arbiter disagreements when tests pass but intent differs.
  • Dependency installation failures that cannot be fixed automatically.
  • Human labeling bias remains in the gold standard.

Core Entities

Models

  • CodeBERT
  • UniXcoder
  • CodeT5+
  • jina-embeddings-v3
  • multilingual-e5-large
  • all-MiniLM-L12-v2
  • all-mpnet-base-v2
  • DeepSeek-V3

Metrics

  • MAP@10
  • MRR
  • NDCG@10
  • Recall

Datasets

  • CoSQA
  • CoSQA+_all
  • CoSQA+_verified
  • CodeSearchNet
  • CSN99
  • CSN Python

Benchmarks

  • CoSQA+
  • CSN Python