Survey of how LLMs become autonomous agents, the core architecture, and the research gaps to make them safe and practical.

January 6, 20266 min

Overview

Production Readiness

0.4

Novelty Score

0.5

Cost Impact Score

0.6

Citation Count

0

Authors

Nadia Sibai, Yara Ahmed, Serry Sibaee, Sawsan AlHalawani, Adel Ammar, Wadii Boulila

Links

Abstract / PDF

Why It Matters For Business

Agentic AI can automate multi-step workflows, connect tools, and keep context. But it raises real risks (wrong actions, privacy leaks, higher compute bills). Companies must pilot with tight guardrails, audit logs, and cost controls.

Summary TLDR

This survey explains how large language models (LLMs) are being wrapped into autonomous agents that plan, use tools, and keep memory. It lays out a simple architecture (perception, LLM brain, memory, action), gives examples (single- and multi-agent flows), and highlights the main technical and governance gaps: verifiable planning, robust long-term memory, multi-agent coordination, safety guardrails, and sustainable inference.

Problem Statement

LLMs are powerful text engines but not full agents. Building safe, reliable systems that can plan, act in the world, remember across sessions, and coordinate multiple roles requires new architectures, evaluation methods, and governance.

Main Contribution

Synthesis of how LLM capabilities extend toward agent-like behavior via reason-act-reflect loops.

An integrative architecture that lists core modules: perception, LLM reasoning/planning, memory, and action execution.

A critical assessment of applications, plus a research agenda covering safety, memory, multi-agent coordination, and sustainable inference.

Key Findings

Agentic behavior arises when LLMs are combined with perception, external memory, and tool execution into a closed-loop reason-act-reflect cycle.

Existing language-model benchmarks can miss cultural and linguistic gaps; one cited Arabic benchmark found leading models score about 30% on culturally grounded reasoning tasks.

Numbers≈30% accuracy on Arabic cultural reasoning (ref [33])

Long multi-step action chains amplify small errors and reduce reliability; non-deterministic behaviors and variable API outputs make repeatability hard.

Agentic systems increase compute and environmental cost because of repeated inference, frequent tool calls, and context growth.

Results

Accuracy

Value30%

Who Should Care

What To Try In 7 Days

Build a simple ReAct-style agent that calls a calculator and a search API; log every tool call.

Add a vector DB for short-term memory and test consistency across 5–10 interactions.

Introduce action-level checkpoints with human approval for any irreversible operation.

Agent Features

Memory

  • short-term (scratchpad)
  • retrieval memory (vector DB)
  • long-term episodic memory

Planning

  • reason-act-reflect loop
  • chain-of-thought reasoning
  • tool-enabled planning

Tool Use

  • API calls
  • search and retrieval
  • calculator and code execution
  • robotic actuation

Frameworks

  • LangChain
  • AutoGen
  • ReAct
  • Toolformer

Is Agentic

true

Architectures

  • single-agent
  • multi-agent
  • hierarchical

Collaboration

  • multi-agent coordination
  • agent communication
  • role assignment

Optimization Features

Token Efficiency

  • context chunking
  • retrieval-based context narrowing

Infra Optimization

  • use of lightweight rerankers and vector DB tuning

Model Optimization

  • dynamic model selection
  • MoE

System Optimization

  • call batching
  • step-level validation to avoid loops

Training Optimization

  • instruction tuning
  • RLHF (for safer, goal-directed behavior)

Inference Optimization

  • caching tool outputs
  • context compression
  • energy-aware inference

Reproducibility

Open Source Status

  • no

Risks & Boundaries

Limitations

  • Survey-style chapter: conceptual and synthetic, not an empirical method paper.
  • Few new quantitative experiments or benchmarks provided.
  • Recommendations are broad and require follow-up technical work for implementation details.

When Not To Use

  • Do not deploy agentic systems for irreversible, high-stakes actions without strict human approval.
  • Avoid relying on current persistent memory for identity-critical tasks due to drift and privacy risk.

Failure Modes

  • Error amplification across long multi-step workflows.
  • Non-deterministic outputs causing inconsistent behavior.
  • Hallucinated or stale memories leading to wrong actions.
  • Coordination breakdowns in multi-agent teams (deadlocks, cascading failures).

Core Entities

Models

  • GPT-3
  • PaLM
  • LLaMA
  • GPT-4
  • BERT
  • GPT-2

Metrics

  • Accuracy
  • throughput
  • reliability

Datasets

  • culturally grounded Arabic reasoning benchmark (ref [33])

Benchmarks

  • Arabic cultural reasoning benchmark (ref [33])

Context Entities

Models

  • MoE

Metrics

  • energy / compute cost