Survey: how LLMs and LLM-based agents reshape software engineering workflows

August 5, 20249 min

Overview

Decision SnapshotNeeds Validation

Agent frameworks show strong practical gains on multi-step, repo-level tasks but rely on custom datasets and early-stage systems; validate with small pilots before rollout.

Citations15

Evidence Strength0.70

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 45%

Authors

Haolin Jin, Linghan Huang, Haipeng Cai, Jun Yan, Bo Li, Huaming Chen

Links

Abstract / PDF

Why It Matters For Business

LLM-based agents enable higher automation for multi-step engineering tasks (planning, tool use, testing) and often raise real pass rates and reduce human iteration; single LLMs remain cheaper for isolated code generation or simple analysis.

Who Should Care

Summary TLDR

This survey reviews 139 papers (late 2023–2024) comparing standard large language models (LLMs) and LLM-based agents across six software engineering areas: requirements, code generation, autonomous decision-making, design/evaluation, test generation, and security/maintenance. It maps tasks, benchmarks, metrics, and models; proposes agent criteria (decision core, tool use, planning, evaluation, multi-turn context, learning); and shows agent systems scale better on multi-step workflows (multi-agent role division, tool integration, memory) while single LLMs remain cost-effective for isolated tasks. The paper highlights gaps: no unified agent standard, sparse interactive benchmarks, and the need

Problem Statement

Practitioners and researchers lack a clear, unified view of when a large language model is just a powerful generator and when it qualifies as an LLM-based agent (a system that plans, decides, uses tools, evaluates solutions, keeps context, and learns). This ambiguity blocks standardized benchmarks, fair comparisons, and practical adoption in software engineering workflows.

Main Contribution

Collected and analyzed 139 papers (DBLP + arXiv) on LLM and LLM-based agent use in software engineering (six topic areas).

Defined practical criteria to classify an LLM architecture as an ‘agent’ (brain + planning + autonomous tool use + evaluation + multi-turn + learning).

Key Findings

Survey corpus and venue split — the review covers 139 papers and many are preprints.

Numbers139 papers; arXiv accounts for 40.3% of papers

Practical UseExpect rapidly changing results and many early-stage systems; validate claims on stable peer-reviewed datasets before production use.

Evidence RefIntroduction, Fig.1 and venue distribution paragraphs

An agent-style framework can dramatically raise task pass rates in user studies.

NumbersAISD increased use-case pass rate to 75.2% vs 24.1% without human involvement

Practical UseFor end-to-end tasks, pilot agent workflows (task decomposition + feedback loop) to get higher real-world pass rates.

Evidence RefSection IV.B (AISD result)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Use case pass rate (AISD)75.2% with AISD vs 24.1% baselineno human involvement+51.1 ppCAASD / AISD studySection IV.B describes AISD experiment increasing pass rates with agent loop[79]
Pass@1 (L2MAC)90.2% Pass@1 on HumanEvalGPT-4 / Reflexion comparisonsreported as SOTA in paperHumanEvalSection V.B L2MAC claims strong HumanEval performance[104]

What To Try In 7 Days

Run a small agent pilot: pick one repo task (e.g., implement feature + tests) and compose a two-agent pipeline (planner + coder) to measure pass@1 vs single LLM.

Add RAG to a code-search workflow to serve large codebases while controlling token cost.

Benchmark current workflows on HumanEval/MBPP and one repo-specific test suite to measure gains from iteration or tool execution loops.

Agent Features

Memory
experience pools (ExpeL)shared databases / retrieval memories (RAG/Graph-RAG)dynamic code graphs (DCGG)
Planning
ReAct (reason+act) style planningexplicit planning agent (step decomposition)sprint/Agile-driven planning modules
Tool Use
autonomous tool selection and API callsspecialized agent-tool interfaces (SWE-agent ACI)executable code actions (CodeAct)
Frameworks
MetaGPTAgentCoderCodeAgentExpeLReflexionSWE-agentAGILECoder
Is Agentic

Yes

Architectures
single-agent (LLM core)multi-agent role-based pipelineshierarchical agents (manager + workers)
Collaboration
role division (retrieval/planning/coding/debugging)voting and multi-agent discussiondebate/verifier mechanisms (MAD)

Optimization Features

Token Efficiency
RAG to reduce long-context token useselective retrieval + context compression
Infra Optimization
containerized safe runtimes (GoEx)agent-friendly ACI shells to reduce parsing overhead
Model Optimization
LoRAinstruction fine-tuning for task alignment
System Optimization
role specialization to limit per-agent contextdynamic agent scaling (SoA self-organized agents)
Training Optimization
Noisy embedding fine-tuning (NEFTune) to reduce overfittingexperience replay via language feedback (Reflexion/ExpeL)
Inference Optimization
batch prompting and batched API callstool-invocation to limit token footprint (Toolformer)

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

No unified definition or standard benchmark to qualify an LLM as an agent.

Many agent studies use custom or small datasets, limiting generalizability.

When Not To Use

Do not use full multi-agent agentization for single-shot or simple code snippets — single LLM is cheaper and simpler.

Avoid deploying autonomous agents on critical systems before rigorous sandboxed validation and human oversight.

Failure Modes

Hallucination leading to incorrect or insecure code even after multi-agent steps.

Tool-call brittleness and unrecoverable side-effects if tool interfaces are not sandboxed.

Core Entities

Models

GPT-4GPT-3.5 (ChatGPT)CodexCodeLlamaLLaMACodeGenCodeT5+WizardCoderStarCoder

Metrics

Pass@kAccuracyF1 ScorePrecision/RecallPass@1Execution RateUse Case Pass RateHuman Likert scores (RUST)

Datasets

HumanEvalMBPPHumanEval-ETMBPP-ETDefects4JCAASDEvalGPTFixSWE-bench

Benchmarks

HumanEvalMBPPDefects4JToolBenchAPIBenchHotpotQAFEVER

Context Entities

Models

GeminiPaLMClaudeLLaMA2CodeGeeX

Metrics

Win RateSuccess RateCost / Token ConsumptionExecution EffectivenessHuman revision cost

Datasets

BigCloneBenchPRIMEVULVulnHubChatPHP-DBEvalPlus

Benchmarks

SWE-bench LiteProjectDevAPI-BankToolBenchLLMARENA