Overview
Production Readiness
0.4
Novelty Score
0.5
Cost Impact Score
0.6
Citation Count
9
Why It Matters For Business
LLM agents can automate complex multi-step digital tasks but are currently brittle; invest in tool integration, retrieval, and realistic evaluation before production to avoid failures and user trust loss.
Summary TLDR
This is a concise, practice-focused survey of how large language models (LLMs) are used to build autonomous agents. It covers core building blocks (memory, planning, action/tool use, prompting), recent reasoning methods (CoT, Tree/Graph/Tree-of-Thoughts, ReAct, Reflexion), evaluation toolkits (AgentBench, WebArena, ToolLLM/ToolBench), and persistent gaps: multimodality, human alignment, hallucinations, and realistic evaluation. The paper highlights that tools and retrieval are key levers to ground agents, while current LLMs still fail long-horizon, web-style tasks.
Problem Statement
LLM-powered agents promise broad automation but fail in practice on long, multi-step, multimodal tasks because models lack reliable long-term reasoning, grounded knowledge access, tool competence, and standard evaluation benchmarks that reflect real-world complexity.
Main Contribution
Survey of building blocks for LLM agents: memory, planning, and action (tool use).
Review of reasoning and prompting advances used in agents (CoT, self-consistency, Tree/Graph of Thoughts, ReAct, Reflexion).
Summary of modern evaluation platforms and datasets (AgentBench, WebArena, ToolLLM/ToolBench) and their findings.
Identification of core constraints: multimodality, human alignment, hallucinations, and agent-ecosystem complexity.
Practical recommendations: use tools, retrieval, code training, and multi-turn alignment data to improve agent behavior.
Key Findings
Agents built for realistic web tasks still perform far below humans.
Benchmarks reveal a wide gap between top commercial LLMs and open-source models when used as agents.
Tool-oriented instruction data enables large-scale real-world API use by LLMs.
Retrieval-augmented generation (RAG) is the common practical method to ground agents and reduce hallucinations.
Multimodal and speech-capable agents require massive pretraining and special pipelines.
Results
WebArena end-to-end task success (GPT-4 agent)
LLMs evaluated in AgentBench
APIs collected for tool instruction tuning
Who Should Care
What To Try In 7 Days
Run a small WebArena scenario to measure your chosen LLM's real task success.
Add a RAG layer (vector DB + retriever) to an existing chatbot to reduce factual errors.
Prototype one API-call workflow with LangChain or a lightweight API retriever to validate tool-use.
Agent Features
Memory
- short-term context window
- hierarchical memory (cache, vector DB, summaries)
- key-value cache / KV caching
Planning
- task decomposition
- chain-of-thought / self-consistency
- Tree-of-Thoughts / Graph-of-Thoughts
- environment-feedback loops (ReAct, Reflexion)
Tool Use
- API calling (REST)
- code execution
- web search
- database (SQL) queries
Frameworks
- LangChain
- Auto-GPT
- LiteLLM
- ToolLLM
- MemGPT
- LlamaIndex
Is Agentic
true
Architectures
- LLM + tools (planner-executor)
- planner-executor with memory hierarchy
- single-agent and multi-agent compositions
Collaboration
- multi-agent orchestration (AutoGen, multi-agent chat)
- model-to-model orchestration (HuggingGPT)
Optimization Features
Token Efficiency
- prompt and prefix tuning
- context summarization
System Optimization
- use of vector DBs to reduce context length
Training Optimization
- instruction tuning on code and multi-turn data
Inference Optimization
- paged attention / memory management (PagedAttention)
- streaming LLM for long contexts (StreamingLLM)
Reproducibility
Open Source Status
- partial
Risks & Boundaries
Limitations
- Survey paper — no new experiments or code released here.
- Coverage depends on cited literature and may lag newest preprints.
- High-level recommendations; lacks step-by-step engineering recipes.
When Not To Use
- High-stakes decisions that need verifiable facts without human oversight.
- Robotics requiring low-latency closed-loop visual control without a tailored vision stack.
- Applications demanding strict regulatory audit trails without evidence provenance.
Failure Modes
- Long reasoning chains produce incorrect steps or 'logic loops'.
- Hallucinations: fluent but unverifiable claims.
- Tool misuse: wrong API calls or malformed parameters.
- Alignment drift when prompts vary subtly or user preferences change.
Core Entities
Models
- GPT-4
- GPT-3.5
- LLaMA
- LLaMA-2
- ToolLLaMA
- USM
- AlphaCode
Metrics
- end-to-end task success rate
- functional correctness
- human vs agent success comparison
Datasets
- ToolBench
- APIBench
- AgentBench
- WebArena
- HouseHolding
- Web Shopping
- Web Browsing
- BigBench
- MMLU
Benchmarks
- AgentBench
- WebArena
- ToolBench
- APIBench
Context Entities
Models
- BERT
- T5
- BART
- RoBERTa
Metrics
- human annotation
- task success rate
Datasets
- RapidAPI Hub (collected APIs)
Benchmarks
- BigBench
- MMLU

