Overview
Production Readiness
0.4
Novelty Score
0.5
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
Agentic AI can automate multi-step workflows, connect tools, and keep context. But it raises real risks (wrong actions, privacy leaks, higher compute bills). Companies must pilot with tight guardrails, audit logs, and cost controls.
Summary TLDR
This survey explains how large language models (LLMs) are being wrapped into autonomous agents that plan, use tools, and keep memory. It lays out a simple architecture (perception, LLM brain, memory, action), gives examples (single- and multi-agent flows), and highlights the main technical and governance gaps: verifiable planning, robust long-term memory, multi-agent coordination, safety guardrails, and sustainable inference.
Problem Statement
LLMs are powerful text engines but not full agents. Building safe, reliable systems that can plan, act in the world, remember across sessions, and coordinate multiple roles requires new architectures, evaluation methods, and governance.
Main Contribution
Synthesis of how LLM capabilities extend toward agent-like behavior via reason-act-reflect loops.
An integrative architecture that lists core modules: perception, LLM reasoning/planning, memory, and action execution.
A critical assessment of applications, plus a research agenda covering safety, memory, multi-agent coordination, and sustainable inference.
Key Findings
Agentic behavior arises when LLMs are combined with perception, external memory, and tool execution into a closed-loop reason-act-reflect cycle.
Existing language-model benchmarks can miss cultural and linguistic gaps; one cited Arabic benchmark found leading models score about 30% on culturally grounded reasoning tasks.
Long multi-step action chains amplify small errors and reduce reliability; non-deterministic behaviors and variable API outputs make repeatability hard.
Agentic systems increase compute and environmental cost because of repeated inference, frequent tool calls, and context growth.
Results
Accuracy
Who Should Care
What To Try In 7 Days
Build a simple ReAct-style agent that calls a calculator and a search API; log every tool call.
Add a vector DB for short-term memory and test consistency across 5–10 interactions.
Introduce action-level checkpoints with human approval for any irreversible operation.
Agent Features
Memory
- short-term (scratchpad)
- retrieval memory (vector DB)
- long-term episodic memory
Planning
- reason-act-reflect loop
- chain-of-thought reasoning
- tool-enabled planning
Tool Use
- API calls
- search and retrieval
- calculator and code execution
- robotic actuation
Frameworks
- LangChain
- AutoGen
- ReAct
- Toolformer
Is Agentic
true
Architectures
- single-agent
- multi-agent
- hierarchical
Collaboration
- multi-agent coordination
- agent communication
- role assignment
Optimization Features
Token Efficiency
- context chunking
- retrieval-based context narrowing
Infra Optimization
- use of lightweight rerankers and vector DB tuning
Model Optimization
- dynamic model selection
- MoE
System Optimization
- call batching
- step-level validation to avoid loops
Training Optimization
- instruction tuning
- RLHF (for safer, goal-directed behavior)
Inference Optimization
- caching tool outputs
- context compression
- energy-aware inference
Reproducibility
Open Source Status
- no
Risks & Boundaries
Limitations
- Survey-style chapter: conceptual and synthetic, not an empirical method paper.
- Few new quantitative experiments or benchmarks provided.
- Recommendations are broad and require follow-up technical work for implementation details.
When Not To Use
- Do not deploy agentic systems for irreversible, high-stakes actions without strict human approval.
- Avoid relying on current persistent memory for identity-critical tasks due to drift and privacy risk.
Failure Modes
- Error amplification across long multi-step workflows.
- Non-deterministic outputs causing inconsistent behavior.
- Hallucinated or stale memories leading to wrong actions.
- Coordination breakdowns in multi-agent teams (deadlocks, cascading failures).
Core Entities
Models
- GPT-3
- PaLM
- LLaMA
- GPT-4
- BERT
- GPT-2
Metrics
- Accuracy
- throughput
- reliability
Datasets
- culturally grounded Arabic reasoning benchmark (ref [33])
Benchmarks
- Arabic cultural reasoning benchmark (ref [33])
Context Entities
Models
- MoE
Metrics
- energy / compute cost

