A practical review of how LLMs build, extend, and are tested as autonomous agents

Overview

Decision SnapshotNeeds Validation

The paper aggregates recent results and benchmarks but is a review; evidence varies by cited study and benchmarks show substantial gaps in real-world task performance.

Citations9

Evidence Strength0.60

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/5

Findings with evidence refs: 5/5

Results with explicit delta: 1/3

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 40%

Novelty: 50%

Authors

Saikat Barua

Links

Abstract / PDF

Why It Matters For Business

LLM agents can automate complex multi-step digital tasks but are currently brittle; invest in tool integration, retrieval, and realistic evaluation before production to avoid failures and user trust loss.

Who Should Care

CTO Product Manager ML Engineer Founder Engineering Lead

Summary TLDR

This is a concise, practice-focused survey of how large language models (LLMs) are used to build autonomous agents. It covers core building blocks (memory, planning, action/tool use, prompting), recent reasoning methods (CoT, Tree/Graph/Tree-of-Thoughts, ReAct, Reflexion), evaluation toolkits (AgentBench, WebArena, ToolLLM/ToolBench), and persistent gaps: multimodality, human alignment, hallucinations, and realistic evaluation. The paper highlights that tools and retrieval are key levers to ground agents, while current LLMs still fail long-horizon, web-style tasks.

Problem Statement

LLM-powered agents promise broad automation but fail in practice on long, multi-step, multimodal tasks because models lack reliable long-term reasoning, grounded knowledge access, tool competence, and standard evaluation benchmarks that reflect real-world complexity.

Main Contribution

Survey of building blocks for LLM agents: memory, planning, and action (tool use).

Review of reasoning and prompting advances used in agents (CoT, self-consistency, Tree/Graph of Thoughts, ReAct, Reflexion).

Key Findings

Agents built for realistic web tasks still perform far below humans.

NumbersGPT-4 agent task success 14.41% vs human 78.24%

Practical UseDon't expect out-of-the-box LLMs to solve long web workflows; add tool wrappers, retrieval, and staged testing before production use.

Evidence RefWebArena (sec 4.3.1)

Benchmarks reveal a wide gap between top commercial LLMs and open-source models when used as agents.

NumbersEvaluation covered 27 API-based and OSS LLMs

Practical UseChoose higher-capability models or invest in fine-tuning and tool/data augmentation for OSS models before deploying agentic behavior.

Evidence RefAgentBench (sec 4.2.1)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
WebArena end-to-end task success (GPT-4 agent)	14.41%	human 78.24%	-63.83pp	WebArena (sec 4.3.1)	Best GPT-4-based agent achieves 14.41% vs human 78.24%	WebArena sec 4.3.1
LLMs evaluated in AgentBench	27 LLMs tested	—	—	AgentBench (sec 4.2.1)	Extensive tests over 27 API-based and OSS LLMs	AgentBench sec 4.2.1

What To Try In 7 Days

Run a small WebArena scenario to measure your chosen LLM's real task success.

Add a RAG layer (vector DB + retriever) to an existing chatbot to reduce factual errors.

Prototype one API-call workflow with LangChain or a lightweight API retriever to validate tool-use.

Agent Features

Memory

short-term context windowhierarchical memory (cache, vector DB, summaries)key-value cache / KV caching

Planning

task decompositionchain-of-thought / self-consistencyTree-of-Thoughts / Graph-of-Thoughtsenvironment-feedback loops (ReAct, Reflexion)

Tool Use

API calling (REST)code executionweb searchdatabase (SQL) queries

Frameworks

LangChainAuto-GPTLiteLLMToolLLMMemGPTLlamaIndex

Is Agentic

Yes

Architectures

LLM + tools (planner-executor)planner-executor with memory hierarchysingle-agent and multi-agent compositions

Collaboration

multi-agent orchestration (AutoGen, multi-agent chat)model-to-model orchestration (HuggingGPT)

Optimization Features

Token Efficiency

prompt and prefix tuningcontext summarization

System Optimization

use of vector DBs to reduce context length

Training Optimization

instruction tuning on code and multi-turn data

Inference Optimization

paged attention / memory management (PagedAttention)streaming LLM for long contexts (StreamingLLM)

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

Survey paper — no new experiments or code released here.

Coverage depends on cited literature and may lag newest preprints.

When Not To Use

High-stakes decisions that need verifiable facts without human oversight.

Robotics requiring low-latency closed-loop visual control without a tailored vision stack.

Failure Modes

Long reasoning chains produce incorrect steps or 'logic loops'.

Hallucinations: fluent but unverifiable claims.

Core Entities

Models

GPT-4GPT-3.5LLaMALLaMA-2ToolLLaMAUSMAlphaCode

Metrics

end-to-end task success ratefunctional correctnesshuman vs agent success comparison

Datasets

ToolBenchAPIBenchAgentBenchWebArenaHouseHoldingWeb ShoppingWeb BrowsingBigBenchMMLU

Benchmarks

AgentBenchWebArenaToolBenchAPIBench

Context Entities

Models

BERTT5BARTRoBERTa

Metrics

human annotationtask success rate

Datasets

RapidAPI Hub (collected APIs)

Benchmarks

BigBenchMMLU

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Agents built for realistic web tasks still perform far below humans.

Benchmarks reveal a wide gap between top commercial LLMs and open-source models when used as agents.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Systematizes reusable 'agentic skills' for LLM agents, their lifecycle, design patterns, risks, and evaluation

Key finding

ETAPP: an 800-case sandbox benchmark and key-point LLM evaluator for personalized tool use

Key finding

TOOLMAKER: agents that turn scientific GitHub repos into executable LLM tools

Key finding

ToolBH: a multi-level benchmark that finds tool-use hallucinations in LLMs

Key finding

Let two agents use different retrieval tools and iteratively query the web to cut hallucinations in fact-checking

Key finding