AutoHD: ask an LLM to write Python heuristics, evolve them, and use those heuristics to guide search at inference time

February 26, 20257 min

Overview

Decision SnapshotNeeds Validation

The method is practical for discrete, small-to-moderate planning tasks where you can encode states and run search; evidence is strong on toy benchmarks but limited to the tested domains and small validation sizes.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 3/3

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 55%

Production readiness: 65%

Novelty: 60%

Authors

Hongyi Ling, Shubham Parashar, Sambhav Khurana, Blake Olson, Anwesha Basu, Gaurangi Sinha, Zhengzhong Tu, James Caverlee, Shuiwang Ji

Links

Abstract / PDF / Code

Why It Matters For Business

AutoHD improves LLM planning accuracy without extra model training and produces interpretable Python heuristics you can inspect and reuse.

Who Should Care

Summary TLDR

AutoHD prompts an LLM to generate explicit heuristic functions (as Python) and then refines them via an LLM-driven evolution loop. The best heuristic guides inference-time search (A* or greedy BFS) so the LLM no longer needs to self-verify every intermediate step. On three planning benchmarks (Blocksworld, Game of 24, Rubik's Cube), AutoHD raises accuracy substantially versus CoT/ToT baselines and other search or verifier methods, while requiring no model fine-tuning and providing interpretable heuristic code.

Problem Statement

LLM-based planning methods either rely on unreliable LLM self-verification or costly external verifiers. We need a lightweight, interpretable way to evaluate intermediate states so search can be guided accurately at inference time without extra model training.

Main Contribution

AutoHD: a pipeline that prompts an LLM to produce heuristic functions as Python, then uses those heuristics to guide search during inference.

Heuristic evolution: an iterative LLM-driven generation + selection loop that refines heuristics using a small validation set.

Key Findings

AutoHD substantially improves planning accuracy on Blocksworld when compared to baselines.

NumbersAutoHD All accuracies: 42.4% (GPT-4o-mini), 75.1% (GPT-4o), 59.1% (LLaMA 3.1 70B)

Practical UseIf you use LLMs for discrete block-manipulation planning, adding LLM-written heuristics can roughly double baseline accuracy on evaluated datasets.

Evidence RefTable 8; Blocksworld 'All' column

On Rubik's Cube AutoHD beats strong baselines and a trained policy-based method.

NumbersAutoHD: 82.5%/83.1%/84.7% vs XoT: 67.2%/79.8%/78.1% (GPT-4o-mini/4o/LLaMA)

Practical UseFor multi-step spatial puzzles, using LLM-generated heuristics with search gives large accuracy gains over naive CoT/ToT and improves over a trained policy+MCTS method on the tested 2×2 dataset.

Evidence RefTable 4 and discussion in Section 4.3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy42.4% / 75.1% / 59.1%best baseline varies (ToT/CoT-SC etc.)roughly up to ~2× vs baselines on some LLMsBlocksworld (All)Table 8; Section 4.1Table 8
Accuracy54% / 70% / 69%CoT/CoT-SC/ToTsubstantial absolute gains vs simple methods (IO/CoT)Game of 24Section 4.2 and Table 3Table 3

What To Try In 7 Days

Prompt your preferred LLM to generate simple heuristic scoring functions for one planning task (use provided prompts).

Run heuristic-guided greedy BFS or A* using the generated heuristics and compare to your current CoT pipeline.

Implement a small evolution loop: generate ~10 heuristics, validate on ~10 held examples, pick the best and test.

Agent Features

Planning
heuristic-guided searchA* searchgreedy BFS
Tool Use
LLM code generation (Python heuristics)
Frameworks
heuristic evolution (generation + selection)
Is Agentic

Yes

Architectures
single-agent LLM-guided planner

Optimization Features

Token Efficiency
fewer LLM calls at inference since heuristic is precomputed and reused
Inference Optimization
use heuristic to avoid repeated LLM evaluations during search

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Validation sets for heuristic evolution are small (~10 examples), which risks overfitting heuristics to the validation split.

Evaluations are on discrete toy planning problems (Blocksworld, Game of 24, 2×2 Rubik) with short horizons; generalization to large, continuous, or long-horizon tasks is untested.

When Not To Use

Tasks without a clear symbolic/structured state representation.

High-dimensional continuous control or perception-heavy robotics where heuristics are hard to express in Python.

Failure Modes

LLM produces invalid or logically incorrect heuristic code that misguides search.

Heuristic overfits the small validation set and fails on real test distributions.

Core Entities

Models

GPT-4oGPT-4o-miniLlama 3.1 70BO1 mini

Metrics

Accuracy

Datasets

BlocksworldGame of 242x2 Rubik's Cube

Benchmarks

BlocksworldGame of 24Rubik's Cube

Context Entities

Models

XoT (policy+MCTS baseline)ToT (Tree of Thoughts)CoT, CoT-SC

Metrics

self-consistencyAccuracy