Survey: how LLMs and LLM-based agents reshape software engineering workflows

Overview

Decision SnapshotNeeds Validation

Agent frameworks show strong practical gains on multi-step, repo-level tasks but rely on custom datasets and early-stage systems; validate with small pilots before rollout.

Citations15

Evidence Strength0.70

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 45%

Authors

Haolin Jin, Linghan Huang, Haipeng Cai, Jun Yan, Bo Li, Huaming Chen

Links

Abstract / PDF

Why It Matters For Business

LLM-based agents enable higher automation for multi-step engineering tasks (planning, tool use, testing) and often raise real pass rates and reduce human iteration; single LLMs remain cheaper for isolated code generation or simple analysis.

Who Should Care

CTO Product Manager Engineering Lead ML Engineer Founder

Summary TLDR

This survey reviews 139 papers (late 2023–2024) comparing standard large language models (LLMs) and LLM-based agents across six software engineering areas: requirements, code generation, autonomous decision-making, design/evaluation, test generation, and security/maintenance. It maps tasks, benchmarks, metrics, and models; proposes agent criteria (decision core, tool use, planning, evaluation, multi-turn context, learning); and shows agent systems scale better on multi-step workflows (multi-agent role division, tool integration, memory) while single LLMs remain cost-effective for isolated tasks. The paper highlights gaps: no unified agent standard, sparse interactive benchmarks, and the need

Problem Statement

Practitioners and researchers lack a clear, unified view of when a large language model is just a powerful generator and when it qualifies as an LLM-based agent (a system that plans, decides, uses tools, evaluates solutions, keeps context, and learns). This ambiguity blocks standardized benchmarks, fair comparisons, and practical adoption in software engineering workflows.

Main Contribution

Collected and analyzed 139 papers (DBLP + arXiv) on LLM and LLM-based agent use in software engineering (six topic areas).

Defined practical criteria to classify an LLM architecture as an ‘agent’ (brain + planning + autonomous tool use + evaluation + multi-turn + learning).

Key Findings

Survey corpus and venue split — the review covers 139 papers and many are preprints.

Numbers139 papers; arXiv accounts for 40.3% of papers

Practical UseExpect rapidly changing results and many early-stage systems; validate claims on stable peer-reviewed datasets before production use.

Evidence RefIntroduction, Fig.1 and venue distribution paragraphs

An agent-style framework can dramatically raise task pass rates in user studies.

NumbersAISD increased use-case pass rate to 75.2% vs 24.1% without human involvement

Practical UseFor end-to-end tasks, pilot agent workflows (task decomposition + feedback loop) to get higher real-world pass rates.

Evidence RefSection IV.B (AISD result)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Use case pass rate (AISD)	75.2% with AISD vs 24.1% baseline	no human involvement	+51.1 pp	CAASD / AISD study	Section IV.B describes AISD experiment increasing pass rates with agent loop	[79]
Pass@1 (L2MAC)	90.2% Pass@1 on HumanEval	GPT-4 / Reflexion comparisons	reported as SOTA in paper	HumanEval	Section V.B L2MAC claims strong HumanEval performance	[104]

What To Try In 7 Days

Run a small agent pilot: pick one repo task (e.g., implement feature + tests) and compose a two-agent pipeline (planner + coder) to measure pass@1 vs single LLM.

Add RAG to a code-search workflow to serve large codebases while controlling token cost.

Benchmark current workflows on HumanEval/MBPP and one repo-specific test suite to measure gains from iteration or tool execution loops.

Agent Features

Memory

experience pools (ExpeL)shared databases / retrieval memories (RAG/Graph-RAG)dynamic code graphs (DCGG)

Planning

ReAct (reason+act) style planningexplicit planning agent (step decomposition)sprint/Agile-driven planning modules

Tool Use

autonomous tool selection and API callsspecialized agent-tool interfaces (SWE-agent ACI)executable code actions (CodeAct)

Frameworks

MetaGPTAgentCoderCodeAgentExpeLReflexionSWE-agentAGILECoder

Is Agentic

Yes

Architectures

single-agent (LLM core)multi-agent role-based pipelineshierarchical agents (manager + workers)

Collaboration

role division (retrieval/planning/coding/debugging)voting and multi-agent discussiondebate/verifier mechanisms (MAD)

Optimization Features

Token Efficiency

RAG to reduce long-context token useselective retrieval + context compression

Infra Optimization

containerized safe runtimes (GoEx)agent-friendly ACI shells to reduce parsing overhead

Model Optimization

LoRAinstruction fine-tuning for task alignment

System Optimization

role specialization to limit per-agent contextdynamic agent scaling (SoA self-organized agents)

Training Optimization

Noisy embedding fine-tuning (NEFTune) to reduce overfittingexperience replay via language feedback (Reflexion/ExpeL)

Inference Optimization

batch prompting and batched API callstool-invocation to limit token footprint (Toolformer)

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

No unified definition or standard benchmark to qualify an LLM as an agent.

Many agent studies use custom or small datasets, limiting generalizability.

When Not To Use

Do not use full multi-agent agentization for single-shot or simple code snippets — single LLM is cheaper and simpler.

Avoid deploying autonomous agents on critical systems before rigorous sandboxed validation and human oversight.

Failure Modes

Hallucination leading to incorrect or insecure code even after multi-agent steps.

Tool-call brittleness and unrecoverable side-effects if tool interfaces are not sandboxed.

Core Entities

Models

GPT-4GPT-3.5 (ChatGPT)CodexCodeLlamaLLaMACodeGenCodeT5+WizardCoderStarCoder

Metrics

Pass@kAccuracyF1 ScorePrecision/RecallPass@1Execution RateUse Case Pass RateHuman Likert scores (RUST)

Datasets

HumanEvalMBPPHumanEval-ETMBPP-ETDefects4JCAASDEvalGPTFixSWE-bench

Benchmarks

HumanEvalMBPPDefects4JToolBenchAPIBenchHotpotQAFEVER

Context Entities

Models

GeminiPaLMClaudeLLaMA2CodeGeeX

Metrics

Win RateSuccess RateCost / Token ConsumptionExecution EffectivenessHuman revision cost

Datasets

BigCloneBenchPRIMEVULVulnHubChatPHP-DBEvalPlus

Benchmarks

SWE-bench LiteProjectDevAPI-BankToolBenchLLMARENA

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Survey corpus and venue split — the review covers 139 papers and many are preprints.

An agent-style framework can dramatically raise task pass rates in user studies.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Chemistry foundation models power structure-focused multimodal RAG inside hierarchical multi-agent workflows

Key finding

Argues that 'agentic' buzzwords mostly rebrand decades-old agent and multi-agent research

Key finding

TRiSM: practical trust, risk and security controls for LLM-based multi-agent systems

Key finding

A dynamic town simulation that tests LLM agents on doing tasks while following local cultural norms

Key finding

A process-aware, auditable multi-agent evaluator that produces more stable, human-aligned scores than a single LLM judge

Key finding