CoSQA+: a large multi-choice code-search dataset built by test-driven agent annotations (412k pairs, agent accuracy 93.9%)

June 17, 20247 min

Overview

Decision SnapshotNeeds Validation

The pipeline is practical: agents reached 93.9% accuracy and low annotation cost ($0.00062/sample), but query ambiguity and execution constraints limit perfect automation.

Citations2

Evidence Strength0.80

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 75%

Production readiness: 70%

Novelty: 65%

Authors

Jing Gong, Yanghui Wu, Linxi Liang, Yanlin Wang, Jiachi Chen, Mingwei Liu, Zibin Zheng

Links

Abstract / PDF / Code / Data

Why It Matters For Business

CoSQA+ reduces expensive human labeling by using test-driven agents to create a large, functionally verified multi-choice code-search dataset that improves retrieval performance and cuts annotation cost to roughly $0.00062 per sample.

Who Should Care

Summary TLDR

This paper builds CoSQA+, a multi-choice code-search dataset (412,080 query–code pairs) labeled by an automated test-driven agent pipeline. The pipeline generates tests, runs them in Docker, fixes dependency/runtime issues, and uses a final arbiter to decide matches. The agents achieve 93.9% accuracy versus human experts on a 1,000-sample ground truth, and generated tests execute at an 83.67% rate. Models fine-tuned on CoSQA+ (CodeBERT, UniXcoder, CodeT5+) show consistent MAP@10 and MRR gains on a CSN Python test set compared to CoSQA.

Problem Statement

Existing code-search benchmarks treat search as one-to-one and rely on humans or LLM semantics, not functional verification. Real developers often need multiple correct snippets per query (survey: 63.2% of queries). This mismatch harms evaluation and training quality and limits scalability and accuracy.

Main Contribution

CoSQA+ dataset: 412,080 agent-labeled query–code pairs and a 1,000-sample human-verified subset.

A fully automated test-driven annotation pipeline (screener, test generator, executor, bug fixer, arbiter).

Key Findings

CoSQA+ provides 412,080 labeled query–code pairs.

Numbers412,080 pairs; 132,952 unique codes

Practical UseYou can train retrieval models at scale on multi-choice data instead of single-choice pairs.

Evidence RefTable II

Test-driven agents label with high agreement to human experts.

NumbersAgent accuracy 93.9% ±0.020 vs expert ground truth

Practical UseAutomated test-based annotation can replace much manual labeling and scale cheaply.

Evidence RefTable IV

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy0.939human without tests 0.891+0.0481,000-sample ground truthAgent-based annotator accuracy vs expert majority voteTable IV
Executable test program rate83.67%CoSQA+ pipeline outputsProportion of generated tests that run successfullyTable II

What To Try In 7 Days

Download CoSQA+ verified subset and run your model's evaluation to compare MAP@10 vs your current benchmark.

Fine-tune a small retrieval model (e.g., CodeBERT or UniXcoder) on CoSQA+ training split and measure MAP@10 and MRR on a held-out CSN Python test.

Run the open-source test-driven agent on a 100–500 sample subset of your codebase to estimate executable-test rate and annotation cost.

Agent Features

Memory
short-term execution traces and logs used per sample
Planning
multi-stage triage (screener → test generation → execution → bug fixing → arbiter)iterative dependency installation and test rerun
Tool Use
Docker for safe executionpackage managers for dependency fixesLLM calls to synthesize repair/install commands
Frameworks
DeepSeek-V3custom test generation and arbiter prompts
Is Agentic

Yes

Architectures
LLM-driven modular agent pipelineembedder ensemble for candidate selection
Collaboration
single automated pipeline with modular stages (not multi-agent negotiation)

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

Query ambiguity: vague queries lead to multiple valid interpretations and mislabels.

Test misjudgment risk: correct code outside 20-candidate set may be missed.

When Not To Use

Datasets with closed-source or non-executable code (no runtime available).

Queries that require design rationale or non-functional aspects (style, performance trade-offs).

Failure Modes

Ambiguous natural language leads to wrong test generation.

Checker and arbiter disagreements when tests pass but intent differs.

Core Entities

Models

CodeBERTUniXcoderCodeT5+jina-embeddings-v3multilingual-e5-largeall-MiniLM-L12-v2all-mpnet-base-v2DeepSeek-V3

Metrics

MAP@10MRRNDCG@10Recall

Datasets

CoSQACoSQA+_allCoSQA+_verifiedCodeSearchNetCSN99CSN Python

Benchmarks

CoSQA+CSN Python