Overview
The pipeline is practical: agents reached 93.9% accuracy and low annotation cost ($0.00062/sample), but query ambiguity and execution constraints limit perfect automation.
Citations2
Evidence Strength0.80
Confidence0.85
Risk Signals11
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 4/5
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 75%
Production readiness: 70%
Novelty: 65%
Why It Matters For Business
CoSQA+ reduces expensive human labeling by using test-driven agents to create a large, functionally verified multi-choice code-search dataset that improves retrieval performance and cuts annotation cost to roughly $0.00062 per sample.
Who Should Care
Summary TLDR
This paper builds CoSQA+, a multi-choice code-search dataset (412,080 query–code pairs) labeled by an automated test-driven agent pipeline. The pipeline generates tests, runs them in Docker, fixes dependency/runtime issues, and uses a final arbiter to decide matches. The agents achieve 93.9% accuracy versus human experts on a 1,000-sample ground truth, and generated tests execute at an 83.67% rate. Models fine-tuned on CoSQA+ (CodeBERT, UniXcoder, CodeT5+) show consistent MAP@10 and MRR gains on a CSN Python test set compared to CoSQA.
Problem Statement
Existing code-search benchmarks treat search as one-to-one and rely on humans or LLM semantics, not functional verification. Real developers often need multiple correct snippets per query (survey: 63.2% of queries). This mismatch harms evaluation and training quality and limits scalability and accuracy.
Main Contribution
CoSQA+ dataset: 412,080 agent-labeled query–code pairs and a 1,000-sample human-verified subset.
A fully automated test-driven annotation pipeline (screener, test generator, executor, bug fixer, arbiter).
Key Findings
CoSQA+ provides 412,080 labeled query–code pairs.
Test-driven agents label with high agreement to human experts.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | 0.939 | human without tests 0.891 | +0.048 | 1,000-sample ground truth | Agent-based annotator accuracy vs expert majority vote | Table IV |
| Executable test program rate | 83.67% | — | — | CoSQA+ pipeline outputs | Proportion of generated tests that run successfully | Table II |
What To Try In 7 Days
Download CoSQA+ verified subset and run your model's evaluation to compare MAP@10 vs your current benchmark.
Fine-tune a small retrieval model (e.g., CodeBERT or UniXcoder) on CoSQA+ training split and measure MAP@10 and MRR on a held-out CSN Python test.
Run the open-source test-driven agent on a 100–500 sample subset of your codebase to estimate executable-test rate and annotation cost.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Reproducibility
Risks & Boundaries
Limitations
Query ambiguity: vague queries lead to multiple valid interpretations and mislabels.
Test misjudgment risk: correct code outside 20-candidate set may be missed.
When Not To Use
Datasets with closed-source or non-executable code (no runtime available).
Queries that require design rationale or non-functional aspects (style, performance trade-offs).
Failure Modes
Ambiguous natural language leads to wrong test generation.
Checker and arbiter disagreements when tests pass but intent differs.

