LLMs are powerful text engines but lack the grounded action and world models needed for true AGI

July 7, 20238 min

Overview

Production Readiness

0.4

Novelty Score

0.45

Cost Impact Score

0.4

Citation Count

10

Authors

Yuxi Ma, Chi Zhang, Song-Chun Zhu

Links

Abstract / PDF

Why It Matters For Business

LLMs are strong language tools but not reliable autonomous reasoners; businesses should treat them as assistants, validate critical outputs, and invest in grounded data and robust evaluation before automating decisions.

Summary TLDR

This perspective argues that current large language models (LLMs) are powerful pattern predictors but not artificial general intelligence (AGI). The authors review evaluations showing LLMs excel at language-style exams yet fail on many reasoning, causal, and abstract tasks. They attribute this gap to missing grounding: LLMs lack active interaction, a value-driven task generator, and a world model that links symbols to real-world effects. The paper recommends transparent evaluation, richer embodied environments for learning, and architectures that unify knowing and acting.

Problem Statement

LLM benchmarks and scores overstate model understanding because LLMs learn statistical patterns from text without grounding in action. Evaluation leakage, shortcut learning, and scale-based myths hide failures in reasoning, causal inference, and concept learning, so current systems cannot form the action-linked world models AGI requires.

Main Contribution

Survey and synthesize failure modes across standardized tests and ability-oriented benchmarks for LLMs.

Argue AGI requires four agent traits: endless task ability, autonomous task generation, a value system, and a grounded world model.

Propose 'unity of knowing and acting'—active interaction and trial-and-error are needed to form robust concepts and knowledge.

Recommend research directions: transparent evaluation, affordance-rich interactive environments, and cognitive architectures unifying action and knowledge.

Key Findings

LLMs score highly on many language-style standardized exams but fall behind on reasoning-heavy subjects.

NumbersGPT-4: SAT Verbal ~169/170 (~99th), SAT Math ~700/800 (~89th); poor on Gaokao/JEE (see AGIEval/JEEBench)

When semantics are removed, LLMs' symbolic reasoning collapses to near-random performance.

NumbersSymbolic reasoning drops to ~50% or lower (often near random) in Tang et al. setups

LLMs fail many reasoning and abstract tasks even with large models and prompting.

NumbersOn MATH-level problems and abstract benchmarks GPT-4 often reaches ~40–60% or lower accuracy depending on task; many bas

Evaluation contamination and metric choices can create a misleading impression of emergent abilities.

NumbersAGIEval average for GPT-4 varies ~61.3% under some CoT settings; dataset overlap risk is discussed qualitatively

Shortcut learning and inverse scaling show larger models sometimes exploit spurious cues and perform worse on adversarial tests.

NumbersInverse-scaling observed across 11 datasets; larger models show bigger performance drop on anti-shortcut tests (Tang et

Results

GPT-4 exam percentiles

ValueSAT Verbal 169/170 (~99th); SAT Math 700/800 (~89th)

Baselinehuman percentiles

AGIEval average (few-shot CoT)

ValueGPT-4 ~61.3% average

BaselineHuman avg ~67%

JEEBench aggregated score (GPT-4)

ValueTotal ~0.316

BaselineRandom ~0.102

Accuracy

ValueAround 40% on hard math problems

BaselineHuman/top performers higher

Corr2Cause (causal inference)

ValueBest F1 ~29.08 (GPT-4)

BaselineRandom baselines F1 ~13–20

Who Should Care

What To Try In 7 Days

Audit your evaluation sets for overlap with public web data and remove leaked examples.

Run adversarial anti-shortcut tests (trigger-word, style changes) against your models.

Add simple symbolic and causal probes to check reasoning gaps in your pipelines.

Agent Features

Memory

  • Statistical memory from training data; no grounded episodic memory

Planning

  • Limited text-level planning; no embodied action planning

Tool Use

  • Textual simulation of tools; no physical affordance-driven tool use

Frameworks

  • Value system (proposed) as driver for autonomous task generation
  • World model (proposed) linking symbols to effects

Architectures

  • LLMs (transformer-based) as statistical predictors

Collaboration

  • Not discussed as multi-agent; conversational chaining can loop

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Perspective paper: no new experimental data provided by authors.
  • Relies on cited empirical studies; individual benchmark details vary by source.
  • Proposals (AGI-verse, cognitive architectures) are conceptual and require validation.

When Not To Use

  • Not a source of new algorithms or code for immediate deployment.
  • Not suitable as sole evidence for model capability in production without further testing.

Failure Modes

  • Hallucination and surface-level fluency without grounding
  • Shortcut learning triggered by spurious cues
  • Inverse scaling where larger models perform worse on some tasks
  • Evaluation contamination from training data overlap

Core Entities

Models

  • GPT-4
  • GPT-3.5
  • Text-Davinci-3
  • LLaMA
  • Alpaca
  • OPT variants

Metrics

  • Accuracy
  • F1
  • percentile scores

Datasets

  • AGIEval
  • MATH
  • JEEBench
  • Gaokao
  • Corr2Cause
  • Only Connect
  • ACRE/ARC/BIG-Bench/RAVEN

Benchmarks

  • Standardized tests (SAT/GRE/LSAT/Bar/Gaokao/JEE)
  • Ability-oriented benchmarks (math, logic, causal, abstract, ToM, compositionality)

Context Entities

Models

  • BERT-based variants
  • RoBERTa
  • DeBERTa

Datasets

  • SST2 (anti-shortcut tests)
  • VNHSGE (Vietnamese high school exam)