Overview
Agent frameworks show strong practical gains on multi-step, repo-level tasks but rely on custom datasets and early-stage systems; validate with small pilots before rollout.
Citations15
Evidence Strength0.70
Confidence0.80
Risk Signals11
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 5/5
Reproducibility
Status: No open assets linked
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 60%
Novelty: 45%
Why It Matters For Business
LLM-based agents enable higher automation for multi-step engineering tasks (planning, tool use, testing) and often raise real pass rates and reduce human iteration; single LLMs remain cheaper for isolated code generation or simple analysis.
Who Should Care
Summary TLDR
This survey reviews 139 papers (late 2023–2024) comparing standard large language models (LLMs) and LLM-based agents across six software engineering areas: requirements, code generation, autonomous decision-making, design/evaluation, test generation, and security/maintenance. It maps tasks, benchmarks, metrics, and models; proposes agent criteria (decision core, tool use, planning, evaluation, multi-turn context, learning); and shows agent systems scale better on multi-step workflows (multi-agent role division, tool integration, memory) while single LLMs remain cost-effective for isolated tasks. The paper highlights gaps: no unified agent standard, sparse interactive benchmarks, and the need
Problem Statement
Practitioners and researchers lack a clear, unified view of when a large language model is just a powerful generator and when it qualifies as an LLM-based agent (a system that plans, decides, uses tools, evaluates solutions, keeps context, and learns). This ambiguity blocks standardized benchmarks, fair comparisons, and practical adoption in software engineering workflows.
Main Contribution
Collected and analyzed 139 papers (DBLP + arXiv) on LLM and LLM-based agent use in software engineering (six topic areas).
Defined practical criteria to classify an LLM architecture as an ‘agent’ (brain + planning + autonomous tool use + evaluation + multi-turn + learning).
Key Findings
Survey corpus and venue split — the review covers 139 papers and many are preprints.
An agent-style framework can dramatically raise task pass rates in user studies.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Use case pass rate (AISD) | 75.2% with AISD vs 24.1% baseline | no human involvement | +51.1 pp | CAASD / AISD study | Section IV.B describes AISD experiment increasing pass rates with agent loop | [79] |
| Pass@1 (L2MAC) | 90.2% Pass@1 on HumanEval | GPT-4 / Reflexion comparisons | reported as SOTA in paper | HumanEval | Section V.B L2MAC claims strong HumanEval performance | [104] |
What To Try In 7 Days
Run a small agent pilot: pick one repo task (e.g., implement feature + tests) and compose a two-agent pipeline (planner + coder) to measure pass@1 vs single LLM.
Add RAG to a code-search workflow to serve large codebases while controlling token cost.
Benchmark current workflows on HumanEval/MBPP and one repo-specific test suite to measure gains from iteration or tool execution loops.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Optimization Features
Token Efficiency
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
No unified definition or standard benchmark to qualify an LLM as an agent.
Many agent studies use custom or small datasets, limiting generalizability.
When Not To Use
Do not use full multi-agent agentization for single-shot or simple code snippets — single LLM is cheaper and simpler.
Avoid deploying autonomous agents on critical systems before rigorous sandboxed validation and human oversight.
Failure Modes
Hallucination leading to incorrect or insecure code even after multi-agent steps.
Tool-call brittleness and unrecoverable side-effects if tool interfaces are not sandboxed.

