Overview
The paper aggregates recent results and benchmarks but is a review; evidence varies by cited study and benchmarks show substantial gaps in real-world task performance.
Citations9
Evidence Strength0.60
Confidence0.80
Risk Signals10
Trust Signals
Findings with numeric evidence: 4/5
Findings with evidence refs: 5/5
Results with explicit delta: 1/3
Reproducibility
Status: No open assets linked
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 40%
Novelty: 50%
Why It Matters For Business
LLM agents can automate complex multi-step digital tasks but are currently brittle; invest in tool integration, retrieval, and realistic evaluation before production to avoid failures and user trust loss.
Who Should Care
Summary TLDR
This is a concise, practice-focused survey of how large language models (LLMs) are used to build autonomous agents. It covers core building blocks (memory, planning, action/tool use, prompting), recent reasoning methods (CoT, Tree/Graph/Tree-of-Thoughts, ReAct, Reflexion), evaluation toolkits (AgentBench, WebArena, ToolLLM/ToolBench), and persistent gaps: multimodality, human alignment, hallucinations, and realistic evaluation. The paper highlights that tools and retrieval are key levers to ground agents, while current LLMs still fail long-horizon, web-style tasks.
Problem Statement
LLM-powered agents promise broad automation but fail in practice on long, multi-step, multimodal tasks because models lack reliable long-term reasoning, grounded knowledge access, tool competence, and standard evaluation benchmarks that reflect real-world complexity.
Main Contribution
Survey of building blocks for LLM agents: memory, planning, and action (tool use).
Review of reasoning and prompting advances used in agents (CoT, self-consistency, Tree/Graph of Thoughts, ReAct, Reflexion).
Key Findings
Agents built for realistic web tasks still perform far below humans.
Benchmarks reveal a wide gap between top commercial LLMs and open-source models when used as agents.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| WebArena end-to-end task success (GPT-4 agent) | 14.41% | human 78.24% | -63.83pp | WebArena (sec 4.3.1) | Best GPT-4-based agent achieves 14.41% vs human 78.24% | WebArena sec 4.3.1 |
| LLMs evaluated in AgentBench | 27 LLMs tested | — | — | AgentBench (sec 4.2.1) | Extensive tests over 27 API-based and OSS LLMs | AgentBench sec 4.2.1 |
What To Try In 7 Days
Run a small WebArena scenario to measure your chosen LLM's real task success.
Add a RAG layer (vector DB + retriever) to an existing chatbot to reduce factual errors.
Prototype one API-call workflow with LangChain or a lightweight API retriever to validate tool-use.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Optimization Features
Token Efficiency
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Survey paper — no new experiments or code released here.
Coverage depends on cited literature and may lag newest preprints.
When Not To Use
High-stakes decisions that need verifiable facts without human oversight.
Robotics requiring low-latency closed-loop visual control without a tailored vision stack.
Failure Modes
Long reasoning chains produce incorrect steps or 'logic loops'.
Hallucinations: fluent but unverifiable claims.

