Overview
Production Readiness
0.6
Novelty Score
0.45
Cost Impact Score
0.6
Citation Count
15
Why It Matters For Business
LLM-based agents enable higher automation for multi-step engineering tasks (planning, tool use, testing) and often raise real pass rates and reduce human iteration; single LLMs remain cheaper for isolated code generation or simple analysis.
Summary TLDR
This survey reviews 139 papers (late 2023–2024) comparing standard large language models (LLMs) and LLM-based agents across six software engineering areas: requirements, code generation, autonomous decision-making, design/evaluation, test generation, and security/maintenance. It maps tasks, benchmarks, metrics, and models; proposes agent criteria (decision core, tool use, planning, evaluation, multi-turn context, learning); and shows agent systems scale better on multi-step workflows (multi-agent role division, tool integration, memory) while single LLMs remain cost-effective for isolated tasks. The paper highlights gaps: no unified agent standard, sparse interactive benchmarks, and the need
Problem Statement
Practitioners and researchers lack a clear, unified view of when a large language model is just a powerful generator and when it qualifies as an LLM-based agent (a system that plans, decides, uses tools, evaluates solutions, keeps context, and learns). This ambiguity blocks standardized benchmarks, fair comparisons, and practical adoption in software engineering workflows.
Main Contribution
Collected and analyzed 139 papers (DBLP + arXiv) on LLM and LLM-based agent use in software engineering (six topic areas).
Defined practical criteria to classify an LLM architecture as an ‘agent’ (brain + planning + autonomous tool use + evaluation + multi-turn + learning).
Summarized tasks, datasets, benchmarks, and metrics for each SE domain and contrasted single LLMs vs agentic systems.
Cataloged common models, tool integrations, and agent frameworks (MetaGPT, SWE-agent, CodeAgent, ExpeL, Reflexion).
Outlined gaps: missing standardized agent benchmarks, dataset scarcity for interactive workflows, and evaluation fragmentation.
Key Findings
Survey corpus and venue split — the review covers 139 papers and many are preprints.
An agent-style framework can dramatically raise task pass rates in user studies.
Agent systems can reach or exceed top code-benchmarks by orchestrating tools and roles.
Retrieval and tool use remain cost-effective complements to longer-context LLMs.
Benchmarks for agent behavior and interactive workflows are scarce and fragmented.
Results
Use case pass rate (AISD)
Pass@1 (L2MAC)
HumanEval-ET pass@1 (AgentCoder)
TICODER user-study correctness
Reflexion first-pass success improvement
Who Should Care
What To Try In 7 Days
Run a small agent pilot: pick one repo task (e.g., implement feature + tests) and compose a two-agent pipeline (planner + coder) to measure pass@1 vs single LLM.
Add RAG to a code-search workflow to serve large codebases while controlling token cost.
Benchmark current workflows on HumanEval/MBPP and one repo-specific test suite to measure gains from iteration or tool execution loops.
Agent Features
Memory
- experience pools (ExpeL)
- shared databases / retrieval memories (RAG/Graph-RAG)
- dynamic code graphs (DCGG)
Planning
- ReAct (reason+act) style planning
- explicit planning agent (step decomposition)
- sprint/Agile-driven planning modules
Tool Use
- autonomous tool selection and API calls
- specialized agent-tool interfaces (SWE-agent ACI)
- executable code actions (CodeAct)
Frameworks
- MetaGPT
- AgentCoder
- CodeAgent
- ExpeL
- Reflexion
- SWE-agent
- AGILECoder
Is Agentic
true
Architectures
- single-agent (LLM core)
- multi-agent role-based pipelines
- hierarchical agents (manager + workers)
Collaboration
- role division (retrieval/planning/coding/debugging)
- voting and multi-agent discussion
- debate/verifier mechanisms (MAD)
Optimization Features
Token Efficiency
- RAG to reduce long-context token use
- selective retrieval + context compression
Infra Optimization
- containerized safe runtimes (GoEx)
- agent-friendly ACI shells to reduce parsing overhead
Model Optimization
- LoRA
- instruction fine-tuning for task alignment
System Optimization
- role specialization to limit per-agent context
- dynamic agent scaling (SoA self-organized agents)
Training Optimization
- Noisy embedding fine-tuning (NEFTune) to reduce overfitting
- experience replay via language feedback (Reflexion/ExpeL)
Inference Optimization
- batch prompting and batched API calls
- tool-invocation to limit token footprint (Toolformer)
Reproducibility
Open Source Status
- partial
Risks & Boundaries
Limitations
- No unified definition or standard benchmark to qualify an LLM as an agent.
- Many agent studies use custom or small datasets, limiting generalizability.
- Agent pipelines add orchestration complexity and can cascade errors without guardrails.
- Open-source models usually lag behind proprietary models in raw capability; agents help but add engineering cost.
When Not To Use
- Do not use full multi-agent agentization for single-shot or simple code snippets — single LLM is cheaper and simpler.
- Avoid deploying autonomous agents on critical systems before rigorous sandboxed validation and human oversight.
- Skip agentization when token cost or latency constraints dominate (e.g., low-latency UI helpers).
Failure Modes
- Hallucination leading to incorrect or insecure code even after multi-agent steps.
- Tool-call brittleness and unrecoverable side-effects if tool interfaces are not sandboxed.
- Error propagation across agents due to inconsistent shared context.
- Benchmarks and evaluation blind spots: passing unit tests despite semantic or security defects.
Core Entities
Models
- GPT-4
- GPT-3.5 (ChatGPT)
- Codex
- CodeLlama
- LLaMA
- CodeGen
- CodeT5+
- WizardCoder
- StarCoder
Metrics
- Pass@k
- Accuracy
- F1 Score
- Precision/Recall
- Pass@1
- Execution Rate
- Use Case Pass Rate
- Human Likert scores (RUST)
Datasets
- HumanEval
- MBPP
- HumanEval-ET
- MBPP-ET
- Defects4J
- CAASD
- EvalGPTFix
- SWE-bench
Benchmarks
- HumanEval
- MBPP
- Defects4J
- ToolBench
- APIBench
- HotpotQA
- FEVER
Context Entities
Models
- Gemini
- PaLM
- Claude
- LLaMA2
- CodeGeeX
Metrics
- Win Rate
- Success Rate
- Cost / Token Consumption
- Execution Effectiveness
- Human revision cost
Datasets
- BigCloneBench
- PRIMEVUL
- VulnHub
- ChatPHP-DB
- EvalPlus
Benchmarks
- SWE-bench Lite
- ProjectDev
- API-Bank
- ToolBench
- LLMARENA

