Overview
Production Readiness
0.7
Novelty Score
0.65
Cost Impact Score
0.75
Citation Count
2
Why It Matters For Business
CoSQA+ reduces expensive human labeling by using test-driven agents to create a large, functionally verified multi-choice code-search dataset that improves retrieval performance and cuts annotation cost to roughly $0.00062 per sample.
Summary TLDR
This paper builds CoSQA+, a multi-choice code-search dataset (412,080 query–code pairs) labeled by an automated test-driven agent pipeline. The pipeline generates tests, runs them in Docker, fixes dependency/runtime issues, and uses a final arbiter to decide matches. The agents achieve 93.9% accuracy versus human experts on a 1,000-sample ground truth, and generated tests execute at an 83.67% rate. Models fine-tuned on CoSQA+ (CodeBERT, UniXcoder, CodeT5+) show consistent MAP@10 and MRR gains on a CSN Python test set compared to CoSQA.
Problem Statement
Existing code-search benchmarks treat search as one-to-one and rely on humans or LLM semantics, not functional verification. Real developers often need multiple correct snippets per query (survey: 63.2% of queries). This mismatch harms evaluation and training quality and limits scalability and accuracy.
Main Contribution
CoSQA+ dataset: 412,080 agent-labeled query–code pairs and a 1,000-sample human-verified subset.
A fully automated test-driven annotation pipeline (screener, test generator, executor, bug fixer, arbiter).
Demonstrated agent accuracy (93.9%) and 83.67% executable test program rate.
Empirical evidence that training on CoSQA+ improves MAP@10 and MRR over CoSQA.
Key Findings
CoSQA+ provides 412,080 labeled query–code pairs.
Test-driven agents label with high agreement to human experts.
Most generated tests run successfully.
CoSQA+ improves model training outcomes versus CoSQA.
Developers often need multiple examples per query.
Results
Accuracy
Executable test program rate
Dataset size
CodeBERT MAP@10 (fine-tuned)
CodeBERT MRR (fine-tuned)
Who Should Care
What To Try In 7 Days
Download CoSQA+ verified subset and run your model's evaluation to compare MAP@10 vs your current benchmark.
Fine-tune a small retrieval model (e.g., CodeBERT or UniXcoder) on CoSQA+ training split and measure MAP@10 and MRR on a held-out CSN Python test.
Run the open-source test-driven agent on a 100–500 sample subset of your codebase to estimate executable-test rate and annotation cost.
Agent Features
Memory
- short-term execution traces and logs used per sample
Planning
- multi-stage triage (screener → test generation → execution → bug fixing → arbiter)
- iterative dependency installation and test rerun
Tool Use
- Docker for safe execution
- package managers for dependency fixes
- LLM calls to synthesize repair/install commands
Frameworks
- DeepSeek-V3
- custom test generation and arbiter prompts
Is Agentic
true
Architectures
- LLM-driven modular agent pipeline
- embedder ensemble for candidate selection
Collaboration
- single automated pipeline with modular stages (not multi-agent negotiation)
Reproducibility
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Query ambiguity: vague queries lead to multiple valid interpretations and mislabels.
- Test misjudgment risk: correct code outside 20-candidate set may be missed.
- Dependence on runnable code: requires code to execute in an isolated environment.
- Partial human verification: only 1,000 pairs human-verified due to cost.
When Not To Use
- Datasets with closed-source or non-executable code (no runtime available).
- Queries that require design rationale or non-functional aspects (style, performance trade-offs).
- Contexts that demand exhaustive human-level semantic nuance.
Failure Modes
- Ambiguous natural language leads to wrong test generation.
- Checker and arbiter disagreements when tests pass but intent differs.
- Dependency installation failures that cannot be fixed automatically.
- Human labeling bias remains in the gold standard.
Core Entities
Models
- CodeBERT
- UniXcoder
- CodeT5+
- jina-embeddings-v3
- multilingual-e5-large
- all-MiniLM-L12-v2
- all-mpnet-base-v2
- DeepSeek-V3
Metrics
- MAP@10
- MRR
- NDCG@10
- Recall
Datasets
- CoSQA
- CoSQA+_all
- CoSQA+_verified
- CodeSearchNet
- CSN99
- CSN Python
Benchmarks
- CoSQA+
- CSN Python

