Use scene graphs + LLMs to split long robot goals into short sub-goals so classical planners solve them fast and reliably

Overview

Decision SnapshotNeeds Validation

Strong simulation evidence shows big speed and success gains, but real-robot validation and robustness to smaller LLMs are missing.

Citations6

Evidence Strength0.80

Confidence0.87

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 55%

Production readiness: 60%

Novelty: 45%

Authors

Yuchen Liu, Luigi Palmieri, Sebastian Koch, Ilche Georgievski, Marco Aiello

Links

Abstract / PDF / Code / Data

Why It Matters For Business

DELTA turns large, slow planning problems into fast, reliable sub-problems so robots can plan long household workflows quickly and with higher success, cutting compute and time costs when paired with a strong LLM.

Who Should Care

ML Engineer Product Manager Engineering Lead Founder

Summary TLDR

DELTA feeds compact 3D scene graphs into an LLM to (1) generate PDDL domain/problem files, (2) prune irrelevant objects, and (3) decompose long goals into a sequence of sub-goals. An off-the-shelf automated planner then solves each sub-problem autoregressively. In simulation on five household-style domains and 3DSG scenes, DELTA (with GPT-4o) reached very high success rates (98–100% on independent-task domains, 80% and 74.7% on two dependent-task domains) and cut planner time and search nodes by roughly three and two orders of magnitude versus not decomposing and versus several LLM baselines.

Problem Statement

Robots struggle to plan long sequences in large scenes because raw LLM plans are often infeasible and classical planners explode in complexity when given many irrelevant objects; we need an automated pipeline that turns scene graphs and a user goal into correct, executable, and efficient task plans.

Main Contribution

DELTA: a 5-step pipeline that combines scene graphs and LLMs to auto-generate PDDL domain/problem files, prune irrelevant items, decompose long goals, and plan sub-tasks autoregressively.

Demonstration that goal decomposition plus SG pruning yields large speedups and higher success rates on long-term household planning tasks versus several LLM-based baselines.

Key Findings

DELTA with GPT-4o achieves highest success rates across evaluated domains.

NumbersPC 98%, Dining 100%, Cleaning 80%, Office 74.67% (Table II)

Practical UseUse DELTA (best with a strong LLM) to reliably solve long household planning tasks in simulation; expect near-perfect results for independent subtasks and strong but lower results for dependent subtasks.

Evidence RefTable II

Goal decomposition reduces automated planner runtime by thousands of times.

NumbersPC planning time: 0.0134s vs 51.76s (≈3860× faster) when decomposed (Table III)

Practical UseDecompose long goals into sub-goals before calling a classical planner to turn planning from minutes into milliseconds in many cases.

Evidence RefTable III

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
success rate (DELTA, GPT-4o)	PC 98%; Dining 100%; Cleaning 80%; Office 74.67%	various LLM baselines (Table II)	Up to +~98% absolute vs naive LLM-as-planner in dependent-task domains	averaged over 3 scenes and 50 trials each (600 trials total)	Table II	Table II
planning time (per-domain)	PC 0.0134s (DELTA) vs 51.76s (no decomposition)	DELTA (w/o decomposition)	≈3.9k× faster	PC domain, averaged succeeded cases	Table III	Table III

What To Try In 7 Days

Build or load a scene graph for your environment and test SG pruning to reduce irrelevant objects.

Use an LLM to generate a PDDL domain/problem and compare planning time vs your current pipeline.

Implement simple goal decomposition (sequence of sub-goals) and run an off-the-shelf planner per sub-goal to measure speedups.

Agent Features

Memory

uses 3D Scene Graphs as structured environment memory

Planning

goal decompositionone-shot domain generation (PDDL)autogeneration of PDDL problem files

Tool Use

PDDL generation via LLMFast Downward plannerPDDLGym simulationVAL plan validator

Frameworks

PDDL

Is Agentic

Yes

Architectures

LLM + automated planner pipelineautoregressive sub-task planning

Optimization Features

Token Efficiency

SG pruning reduces token usage and hallucination risk

System Optimization

autoregressive solving of sub-problems for faster planning

Inference Optimization

scene graph pruning to cut LLM token inputgoal decomposition to reduce planner search space

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://delta-llm.github.io/

Data URLs

3D Scene Graph dataset (Armeni et al.)

Risks & Boundaries

Limitations

Assumes full observability and pre-built scene graphs; real perception noise not tested.

Sensitive to LLM choice—smaller models perform much worse.

When Not To Use

When you lack a reliable scene graph or cannot precompute environment topology.

If you must run planning on a resource-constrained small LLM without external planner.

Failure Modes

Planner timeout on undecomposed or unpruned problems.

LLM-generated incorrect predicates (wrong relations between rooms/items).

Core Entities

Models

GPT-4oGPT-4-turboLlama-3-70B

Metrics

success rateplan lengthplanning timeexpanded nodes

Datasets

3D Scene Graph dataset (Armeni et al.)

Benchmarks

LaundryPC AssemblyDining Table SetupHouse CleaningHome Office Setup

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

DELTA with GPT-4o achieves highest success rates across evaluated domains.

Goal decomposition reduces automated planner runtime by thousands of times.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Systematizes reusable 'agentic skills' for LLM agents, their lifecycle, design patterns, risks, and evaluation

Key finding

A modular agent-based judge that checks step-by-step agent reasoning to better match human task-success labels

Key finding

A conversational LLM agent that automates buyer and seller workflows on a C2C marketplace, cutting interaction time and automating multi‑tap

Key finding

POLARIS: typed, policy-aware plan synthesis and guarded execution for auditable back-office automation

Key finding

Close the Intent–Execution Gap by compiling a creator's 'Vibe' into multi-agent workflows

Key finding