AutoHD: ask an LLM to write Python heuristics, evolve them, and use those heuristics to guide search at inference time

February 26, 20257 min

Overview

Production Readiness

0.65

Novelty Score

0.6

Cost Impact Score

0.55

Citation Count

0

Authors

Hongyi Ling, Shubham Parashar, Sambhav Khurana, Blake Olson, Anwesha Basu, Gaurangi Sinha, Zhengzhong Tu, James Caverlee, Shuiwang Ji

Links

Abstract / PDF

Why It Matters For Business

AutoHD improves LLM planning accuracy without extra model training and produces interpretable Python heuristics you can inspect and reuse.

Summary TLDR

AutoHD prompts an LLM to generate explicit heuristic functions (as Python) and then refines them via an LLM-driven evolution loop. The best heuristic guides inference-time search (A* or greedy BFS) so the LLM no longer needs to self-verify every intermediate step. On three planning benchmarks (Blocksworld, Game of 24, Rubik's Cube), AutoHD raises accuracy substantially versus CoT/ToT baselines and other search or verifier methods, while requiring no model fine-tuning and providing interpretable heuristic code.

Problem Statement

LLM-based planning methods either rely on unreliable LLM self-verification or costly external verifiers. We need a lightweight, interpretable way to evaluate intermediate states so search can be guided accurately at inference time without extra model training.

Main Contribution

AutoHD: a pipeline that prompts an LLM to produce heuristic functions as Python, then uses those heuristics to guide search during inference.

Heuristic evolution: an iterative LLM-driven generation + selection loop that refines heuristics using a small validation set.

Extensive empirical study on Blocksworld, Game of 24, and 2×2 Rubik’s Cube showing large accuracy gains vs common baselines, with released code in Sys2Bench.

Key Findings

AutoHD substantially improves planning accuracy on Blocksworld when compared to baselines.

NumbersAutoHD All accuracies: 42.4% (GPT-4o-mini), 75.1% (GPT-4o), 59.1% (LLaMA 3.1 70B)

On Rubik's Cube AutoHD beats strong baselines and a trained policy-based method.

NumbersAutoHD: 82.5%/83.1%/84.7% vs XoT: 67.2%/79.8%/78.1% (GPT-4o-mini/4o/LLaMA)

Heuristic evolution helps, but gains saturate quickly.

NumbersSelecting across all generations: Blocksworld step-2 95.7% vs final-generation 93.3%; Game of 24 54% vs 41%

Results

Accuracy

Value42.4% / 75.1% / 59.1%

Baselinebest baseline varies (ToT/CoT-SC etc.)

Accuracy

Value54% / 70% / 69%

BaselineCoT/CoT-SC/ToT

Accuracy

Value82.5% / 83.1% / 84.7%

BaselineXoT (67.2% / 79.8% / 78.1%)

Who Should Care

What To Try In 7 Days

Prompt your preferred LLM to generate simple heuristic scoring functions for one planning task (use provided prompts).

Run heuristic-guided greedy BFS or A* using the generated heuristics and compare to your current CoT pipeline.

Implement a small evolution loop: generate ~10 heuristics, validate on ~10 held examples, pick the best and test.

Agent Features

Planning

  • heuristic-guided search
  • A* search
  • greedy BFS

Tool Use

  • LLM code generation (Python heuristics)

Frameworks

  • heuristic evolution (generation + selection)

Is Agentic

true

Architectures

  • single-agent LLM-guided planner

Optimization Features

Token Efficiency

  • fewer LLM calls at inference since heuristic is precomputed and reused

Inference Optimization

  • use heuristic to avoid repeated LLM evaluations during search

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Validation sets for heuristic evolution are small (~10 examples), which risks overfitting heuristics to the validation split.
  • Evaluations are on discrete toy planning problems (Blocksworld, Game of 24, 2×2 Rubik) with short horizons; generalization to large, continuous, or long-horizon tasks is untested.
  • Method depends on LLMs generating correct, executable Python heuristics; generated code can be buggy or unsafe without checks.
  • Search cost still grows with branching factor; heuristics reduce but do not eliminate expensive search for very large state spaces.

When Not To Use

  • Tasks without a clear symbolic/structured state representation.
  • High-dimensional continuous control or perception-heavy robotics where heuristics are hard to express in Python.
  • Safety-critical systems unless the generated heuristics are formally validated.

Failure Modes

  • LLM produces invalid or logically incorrect heuristic code that misguides search.
  • Heuristic overfits the small validation set and fails on real test distributions.
  • Search budget limits prevent finding solutions even with a good heuristic when branching is huge.
  • Heuristics can encode LLM biases, leading search toward favored but wrong solutions.

Core Entities

Models

  • GPT-4o
  • GPT-4o-mini
  • Llama 3.1 70B
  • O1 mini

Metrics

  • Accuracy

Datasets

  • Blocksworld
  • Game of 24
  • 2x2 Rubik's Cube

Benchmarks

  • Blocksworld
  • Game of 24
  • Rubik's Cube

Context Entities

Models

  • XoT (policy+MCTS baseline)
  • ToT (Tree of Thoughts)
  • CoT, CoT-SC

Metrics

  • self-consistency
  • Accuracy