AutoHD: ask an LLM to write Python heuristics, evolve them, and use those heuristics to guide search at inference time

Overview

Decision SnapshotNeeds Validation

The method is practical for discrete, small-to-moderate planning tasks where you can encode states and run search; evidence is strong on toy benchmarks but limited to the tested domains and small validation sizes.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 3/3

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 55%

Production readiness: 65%

Novelty: 60%

Authors

Hongyi Ling, Shubham Parashar, Sambhav Khurana, Blake Olson, Anwesha Basu, Gaurangi Sinha, Zhengzhong Tu, James Caverlee, Shuiwang Ji

Links

Abstract / PDF / Code

Why It Matters For Business

AutoHD improves LLM planning accuracy without extra model training and produces interpretable Python heuristics you can inspect and reuse.

Who Should Care

Product Manager ML Engineer Engineering Lead Data Scientist CTO

Summary TLDR

AutoHD prompts an LLM to generate explicit heuristic functions (as Python) and then refines them via an LLM-driven evolution loop. The best heuristic guides inference-time search (A* or greedy BFS) so the LLM no longer needs to self-verify every intermediate step. On three planning benchmarks (Blocksworld, Game of 24, Rubik's Cube), AutoHD raises accuracy substantially versus CoT/ToT baselines and other search or verifier methods, while requiring no model fine-tuning and providing interpretable heuristic code.

Problem Statement

LLM-based planning methods either rely on unreliable LLM self-verification or costly external verifiers. We need a lightweight, interpretable way to evaluate intermediate states so search can be guided accurately at inference time without extra model training.

Main Contribution

AutoHD: a pipeline that prompts an LLM to produce heuristic functions as Python, then uses those heuristics to guide search during inference.

Heuristic evolution: an iterative LLM-driven generation + selection loop that refines heuristics using a small validation set.

Key Findings

AutoHD substantially improves planning accuracy on Blocksworld when compared to baselines.

NumbersAutoHD All accuracies: 42.4% (GPT-4o-mini), 75.1% (GPT-4o), 59.1% (LLaMA 3.1 70B)

Practical UseIf you use LLMs for discrete block-manipulation planning, adding LLM-written heuristics can roughly double baseline accuracy on evaluated datasets.

Evidence RefTable 8; Blocksworld 'All' column

On Rubik's Cube AutoHD beats strong baselines and a trained policy-based method.

NumbersAutoHD: 82.5%/83.1%/84.7% vs XoT: 67.2%/79.8%/78.1% (GPT-4o-mini/4o/LLaMA)

Practical UseFor multi-step spatial puzzles, using LLM-generated heuristics with search gives large accuracy gains over naive CoT/ToT and improves over a trained policy+MCTS method on the tested 2×2 dataset.

Evidence RefTable 4 and discussion in Section 4.3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	42.4% / 75.1% / 59.1%	best baseline varies (ToT/CoT-SC etc.)	roughly up to ~2× vs baselines on some LLMs	Blocksworld (All)	Table 8; Section 4.1	Table 8
Accuracy	54% / 70% / 69%	CoT/CoT-SC/ToT	substantial absolute gains vs simple methods (IO/CoT)	Game of 24	Section 4.2 and Table 3	Table 3

What To Try In 7 Days

Prompt your preferred LLM to generate simple heuristic scoring functions for one planning task (use provided prompts).

Run heuristic-guided greedy BFS or A* using the generated heuristics and compare to your current CoT pipeline.

Implement a small evolution loop: generate ~10 heuristics, validate on ~10 held examples, pick the best and test.

Agent Features

Planning

heuristic-guided searchA* searchgreedy BFS

Tool Use

LLM code generation (Python heuristics)

Frameworks

heuristic evolution (generation + selection)

Is Agentic

Yes

Architectures

single-agent LLM-guided planner

Optimization Features

Token Efficiency

fewer LLM calls at inference since heuristic is precomputed and reused

Inference Optimization

use heuristic to avoid repeated LLM evaluations during search

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/divelab/sys2bench/

Risks & Boundaries

Limitations

Validation sets for heuristic evolution are small (~10 examples), which risks overfitting heuristics to the validation split.

Evaluations are on discrete toy planning problems (Blocksworld, Game of 24, 2×2 Rubik) with short horizons; generalization to large, continuous, or long-horizon tasks is untested.

When Not To Use

Tasks without a clear symbolic/structured state representation.

High-dimensional continuous control or perception-heavy robotics where heuristics are hard to express in Python.

Failure Modes

LLM produces invalid or logically incorrect heuristic code that misguides search.

Heuristic overfits the small validation set and fails on real test distributions.

Core Entities

Models

GPT-4oGPT-4o-miniLlama 3.1 70BO1 mini

Metrics

Accuracy

Datasets

BlocksworldGame of 242x2 Rubik's Cube

Benchmarks

BlocksworldGame of 24Rubik's Cube

Context Entities

Models

XoT (policy+MCTS baseline)ToT (Tree of Thoughts)CoT, CoT-SC

Metrics

self-consistencyAccuracy

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

AutoHD substantially improves planning accuracy on Blocksworld when compared to baselines.

On Rubik's Cube AutoHD beats strong baselines and a trained policy-based method.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

You May Also Want to Read

Survey: Reframe LLMs as agents that plan, act, and continually learn

Key finding

Reference architecture, multi-agent taxonomy, and enterprise hardening for LLM agents

Key finding

Systematizes reusable 'agentic skills' for LLM agents, their lifecycle, design patterns, risks, and evaluation

Key finding

A closed-loop Sensing→Regulating→Correcting system that routes LLM execution by uncertainty to cut errors and API cost

Key finding

Diffusion-backed agents match accuracy but run ~30% faster and can reach up to 8× speedups in some cases

Key finding