CoSQA+: a large multi-choice code-search dataset built by test-driven agent annotations (412k pairs, agent accuracy 93.9%)

Overview

Decision SnapshotNeeds Validation

The pipeline is practical: agents reached 93.9% accuracy and low annotation cost ($0.00062/sample), but query ambiguity and execution constraints limit perfect automation.

Citations2

Evidence Strength0.80

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 75%

Production readiness: 70%

Novelty: 65%

Authors

Jing Gong, Yanghui Wu, Linxi Liang, Yanlin Wang, Jiachi Chen, Mingwei Liu, Zibin Zheng

Links

Abstract / PDF / Code / Data

Why It Matters For Business

CoSQA+ reduces expensive human labeling by using test-driven agents to create a large, functionally verified multi-choice code-search dataset that improves retrieval performance and cuts annotation cost to roughly $0.00062 per sample.

Who Should Care

CTO ML Engineer Engineering Lead Product Manager Data Scientist

Summary TLDR

This paper builds CoSQA+, a multi-choice code-search dataset (412,080 query–code pairs) labeled by an automated test-driven agent pipeline. The pipeline generates tests, runs them in Docker, fixes dependency/runtime issues, and uses a final arbiter to decide matches. The agents achieve 93.9% accuracy versus human experts on a 1,000-sample ground truth, and generated tests execute at an 83.67% rate. Models fine-tuned on CoSQA+ (CodeBERT, UniXcoder, CodeT5+) show consistent MAP@10 and MRR gains on a CSN Python test set compared to CoSQA.

Problem Statement

Existing code-search benchmarks treat search as one-to-one and rely on humans or LLM semantics, not functional verification. Real developers often need multiple correct snippets per query (survey: 63.2% of queries). This mismatch harms evaluation and training quality and limits scalability and accuracy.

Main Contribution

CoSQA+ dataset: 412,080 agent-labeled query–code pairs and a 1,000-sample human-verified subset.

A fully automated test-driven annotation pipeline (screener, test generator, executor, bug fixer, arbiter).

Key Findings

CoSQA+ provides 412,080 labeled query–code pairs.

Numbers412,080 pairs; 132,952 unique codes

Practical UseYou can train retrieval models at scale on multi-choice data instead of single-choice pairs.

Evidence RefTable II

Test-driven agents label with high agreement to human experts.

NumbersAgent accuracy 93.9% ±0.020 vs expert ground truth

Practical UseAutomated test-based annotation can replace much manual labeling and scale cheaply.

Evidence RefTable IV

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	0.939	human without tests 0.891	+0.048	1,000-sample ground truth	Agent-based annotator accuracy vs expert majority vote	Table IV
Executable test program rate	83.67%	—	—	CoSQA+ pipeline outputs	Proportion of generated tests that run successfully	Table II

What To Try In 7 Days

Download CoSQA+ verified subset and run your model's evaluation to compare MAP@10 vs your current benchmark.

Fine-tune a small retrieval model (e.g., CodeBERT or UniXcoder) on CoSQA+ training split and measure MAP@10 and MRR on a held-out CSN Python test.

Run the open-source test-driven agent on a 100–500 sample subset of your codebase to estimate executable-test rate and annotation cost.

Agent Features

Memory

short-term execution traces and logs used per sample

Planning

multi-stage triage (screener → test generation → execution → bug fixing → arbiter)iterative dependency installation and test rerun

Tool Use

Docker for safe executionpackage managers for dependency fixesLLM calls to synthesize repair/install commands

Frameworks

DeepSeek-V3custom test generation and arbiter prompts

Is Agentic

Yes

Architectures

LLM-driven modular agent pipelineembedder ensemble for candidate selection

Collaboration

single automated pipeline with modular stages (not multi-agent negotiation)

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/DeepSoftwareAnalytics/CoSQA_Plus

Data URLs

https://github.com/DeepSoftwareAnalytics/CoSQA_Plus

Risks & Boundaries

Limitations

Query ambiguity: vague queries lead to multiple valid interpretations and mislabels.

Test misjudgment risk: correct code outside 20-candidate set may be missed.

When Not To Use

Datasets with closed-source or non-executable code (no runtime available).

Queries that require design rationale or non-functional aspects (style, performance trade-offs).

Failure Modes

Ambiguous natural language leads to wrong test generation.

Checker and arbiter disagreements when tests pass but intent differs.

Core Entities

Models

CodeBERTUniXcoderCodeT5+jina-embeddings-v3multilingual-e5-largeall-MiniLM-L12-v2all-mpnet-base-v2DeepSeek-V3

Metrics

MAP@10MRRNDCG@10Recall

Datasets

CoSQACoSQA+_allCoSQA+_verifiedCodeSearchNetCSN99CSN Python

Benchmarks

CoSQA+CSN Python

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

CoSQA+ provides 412,080 labeled query–code pairs.

Test-driven agents label with high agreement to human experts.

Results

What To Try In 7 Days

Agent Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding