LLMs are powerful text engines but lack the grounded action and world models needed for true AGI

July 7, 20238 min

Overview

Decision SnapshotNeeds Validation

The paper surveys many empirical results showing consistent failure modes; recommendations are conceptual and need engineering work to operationalize.

Citations10

Evidence Strength0.70

Confidence0.75

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/5

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 40%

Production readiness: 40%

Novelty: 45%

Authors

Yuxi Ma, Chi Zhang, Song-Chun Zhu

Links

Abstract / PDF

Why It Matters For Business

LLMs are strong language tools but not reliable autonomous reasoners; businesses should treat them as assistants, validate critical outputs, and invest in grounded data and robust evaluation before automating decisions.

Who Should Care

Summary TLDR

This perspective argues that current large language models (LLMs) are powerful pattern predictors but not artificial general intelligence (AGI). The authors review evaluations showing LLMs excel at language-style exams yet fail on many reasoning, causal, and abstract tasks. They attribute this gap to missing grounding: LLMs lack active interaction, a value-driven task generator, and a world model that links symbols to real-world effects. The paper recommends transparent evaluation, richer embodied environments for learning, and architectures that unify knowing and acting.

Problem Statement

LLM benchmarks and scores overstate model understanding because LLMs learn statistical patterns from text without grounding in action. Evaluation leakage, shortcut learning, and scale-based myths hide failures in reasoning, causal inference, and concept learning, so current systems cannot form the action-linked world models AGI requires.

Main Contribution

Survey and synthesize failure modes across standardized tests and ability-oriented benchmarks for LLMs.

Argue AGI requires four agent traits: endless task ability, autonomous task generation, a value system, and a grounded world model.

Key Findings

LLMs score highly on many language-style standardized exams but fall behind on reasoning-heavy subjects.

NumbersGPT-4: SAT Verbal ~169/170 (~99th), SAT Math ~700/800 (~89th); poor on Gaokao/JEE (see AGIEval/JEEBench)

Practical UseDo not equate high exam percentiles with general reasoning ability; test models on reasoning-specific and low-data subjects before deployment.

Evidence RefTable 1; AGIEval Figures/Tables (Zhong et al., 2023); Table 3 (Arora et al.)

When semantics are removed, LLMs' symbolic reasoning collapses to near-random performance.

NumbersSymbolic reasoning drops to ~50% or lower (often near random) in Tang et al. setups

Practical UseAvoid relying on surface-language tests to certify symbolic or abstract reasoning; include symbol-agnostic probes.

Evidence RefFig. 7 and Table 5 (Tang et al., 2023b)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
GPT-4 exam percentilesSAT Verbal 169/170 (~99th); SAT Math 700/800 (~89th)human percentilesTable 1 (OpenAI, 2023)GPT-4 high percentiles on language-centric examsTable 1
AGIEval average (few-shot CoT)GPT-4 ~61.3% averageHuman avg ~67%-~6 percentage points vs human avgAGIEval (Zhong et al.) Table 2Performance drops on reasoning-heavy tasks compared to human averagesTable 2

What To Try In 7 Days

Audit your evaluation sets for overlap with public web data and remove leaked examples.

Run adversarial anti-shortcut tests (trigger-word, style changes) against your models.

Add simple symbolic and causal probes to check reasoning gaps in your pipelines.

Agent Features

Memory
Statistical memory from training data; no grounded episodic memory
Planning
Limited text-level planning; no embodied action planning
Tool Use
Textual simulation of tools; no physical affordance-driven tool use
Frameworks
Value system (proposed) as driver for autonomous task generationWorld model (proposed) linking symbols to effects
Architectures
LLMs (transformer-based) as statistical predictors
Collaboration
Not discussed as multi-agent; conversational chaining can loop

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

Perspective paper: no new experimental data provided by authors.

Relies on cited empirical studies; individual benchmark details vary by source.

When Not To Use

Not a source of new algorithms or code for immediate deployment.

Not suitable as sole evidence for model capability in production without further testing.

Failure Modes

Hallucination and surface-level fluency without grounding

Shortcut learning triggered by spurious cues

Core Entities

Models

GPT-4GPT-3.5Text-Davinci-3LLaMAAlpacaOPT variants

Metrics

AccuracyF1percentile scores

Datasets

AGIEvalMATHJEEBenchGaokaoCorr2CauseOnly ConnectACRE/ARC/BIG-Bench/RAVEN

Benchmarks

Standardized tests (SAT/GRE/LSAT/Bar/Gaokao/JEE)Ability-oriented benchmarks (math, logic, causal, abstract, ToM, compositionality)

Context Entities

Models

BERT-based variantsRoBERTaDeBERTa

Datasets

SST2 (anti-shortcut tests)VNHSGE (Vietnamese high school exam)