LLMs are powerful text engines but lack the grounded action and world models needed for true AGI

Overview

Decision SnapshotNeeds Validation

The paper surveys many empirical results showing consistent failure modes; recommendations are conceptual and need engineering work to operationalize.

Citations10

Evidence Strength0.70

Confidence0.75

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/5

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 40%

Production readiness: 40%

Novelty: 45%

Authors

Yuxi Ma, Chi Zhang, Song-Chun Zhu

Links

Abstract / PDF

Why It Matters For Business

LLMs are strong language tools but not reliable autonomous reasoners; businesses should treat them as assistants, validate critical outputs, and invest in grounded data and robust evaluation before automating decisions.

Who Should Care

CTO Product Manager ML Engineer Data Scientist CEO Founder

Summary TLDR

This perspective argues that current large language models (LLMs) are powerful pattern predictors but not artificial general intelligence (AGI). The authors review evaluations showing LLMs excel at language-style exams yet fail on many reasoning, causal, and abstract tasks. They attribute this gap to missing grounding: LLMs lack active interaction, a value-driven task generator, and a world model that links symbols to real-world effects. The paper recommends transparent evaluation, richer embodied environments for learning, and architectures that unify knowing and acting.

Problem Statement

LLM benchmarks and scores overstate model understanding because LLMs learn statistical patterns from text without grounding in action. Evaluation leakage, shortcut learning, and scale-based myths hide failures in reasoning, causal inference, and concept learning, so current systems cannot form the action-linked world models AGI requires.

Main Contribution

Survey and synthesize failure modes across standardized tests and ability-oriented benchmarks for LLMs.

Argue AGI requires four agent traits: endless task ability, autonomous task generation, a value system, and a grounded world model.

Key Findings

LLMs score highly on many language-style standardized exams but fall behind on reasoning-heavy subjects.

NumbersGPT-4: SAT Verbal ~169/170 (~99th), SAT Math ~700/800 (~89th); poor on Gaokao/JEE (see AGIEval/JEEBench)

Practical UseDo not equate high exam percentiles with general reasoning ability; test models on reasoning-specific and low-data subjects before deployment.

Evidence RefTable 1; AGIEval Figures/Tables (Zhong et al., 2023); Table 3 (Arora et al.)

When semantics are removed, LLMs' symbolic reasoning collapses to near-random performance.

NumbersSymbolic reasoning drops to ~50% or lower (often near random) in Tang et al. setups

Practical UseAvoid relying on surface-language tests to certify symbolic or abstract reasoning; include symbol-agnostic probes.

Evidence RefFig. 7 and Table 5 (Tang et al., 2023b)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
GPT-4 exam percentiles	SAT Verbal 169/170 (~99th); SAT Math 700/800 (~89th)	human percentiles	—	Table 1 (OpenAI, 2023)	GPT-4 high percentiles on language-centric exams	Table 1
AGIEval average (few-shot CoT)	GPT-4 ~61.3% average	Human avg ~67%	-~6 percentage points vs human avg	AGIEval (Zhong et al.) Table 2	Performance drops on reasoning-heavy tasks compared to human averages	Table 2

What To Try In 7 Days

Audit your evaluation sets for overlap with public web data and remove leaked examples.

Run adversarial anti-shortcut tests (trigger-word, style changes) against your models.

Add simple symbolic and causal probes to check reasoning gaps in your pipelines.

Agent Features

Memory

Statistical memory from training data; no grounded episodic memory

Planning

Limited text-level planning; no embodied action planning

Tool Use

Textual simulation of tools; no physical affordance-driven tool use

Frameworks

Value system (proposed) as driver for autonomous task generationWorld model (proposed) linking symbols to effects

Architectures

LLMs (transformer-based) as statistical predictors

Collaboration

Not discussed as multi-agent; conversational chaining can loop

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

Perspective paper: no new experimental data provided by authors.

Relies on cited empirical studies; individual benchmark details vary by source.

When Not To Use

Not a source of new algorithms or code for immediate deployment.

Not suitable as sole evidence for model capability in production without further testing.

Failure Modes

Hallucination and surface-level fluency without grounding

Shortcut learning triggered by spurious cues

Core Entities

Models

GPT-4GPT-3.5Text-Davinci-3LLaMAAlpacaOPT variants

Metrics

AccuracyF1percentile scores

Datasets

AGIEvalMATHJEEBenchGaokaoCorr2CauseOnly ConnectACRE/ARC/BIG-Bench/RAVEN

Benchmarks

Standardized tests (SAT/GRE/LSAT/Bar/Gaokao/JEE)Ability-oriented benchmarks (math, logic, causal, abstract, ToM, compositionality)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

LLMs score highly on many language-style standardized exams but fall behind on reasoning-heavy subjects.

When semantics are removed, LLMs' symbolic reasoning collapses to near-random performance.

Results

What To Try In 7 Days

Agent Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Datasets

You May Also Want to Read

A weekly-updated, contamination-free medical benchmark plus automated rubrics that align better with physicians than LLM-as-a-judge

Key finding

When synthetic training data and LLM evaluators are related, evaluators unfairly favor the student models

Key finding

Auto-update benchmarks with two LLM-driven strategies to reduce leakage and tune difficulty

Key finding

Ko-H5 and an open Korean LLM leaderboard: private tests, new Korean tasks, and when benchmarks stop helping

Key finding

TreeEval: benchmark-free LLM evaluation via LLM examiner and tree planning

Key finding