Overview
Production Readiness
0.65
Novelty Score
0.6
Cost Impact Score
0.55
Citation Count
0
Why It Matters For Business
AutoHD improves LLM planning accuracy without extra model training and produces interpretable Python heuristics you can inspect and reuse.
Summary TLDR
AutoHD prompts an LLM to generate explicit heuristic functions (as Python) and then refines them via an LLM-driven evolution loop. The best heuristic guides inference-time search (A* or greedy BFS) so the LLM no longer needs to self-verify every intermediate step. On three planning benchmarks (Blocksworld, Game of 24, Rubik's Cube), AutoHD raises accuracy substantially versus CoT/ToT baselines and other search or verifier methods, while requiring no model fine-tuning and providing interpretable heuristic code.
Problem Statement
LLM-based planning methods either rely on unreliable LLM self-verification or costly external verifiers. We need a lightweight, interpretable way to evaluate intermediate states so search can be guided accurately at inference time without extra model training.
Main Contribution
AutoHD: a pipeline that prompts an LLM to produce heuristic functions as Python, then uses those heuristics to guide search during inference.
Heuristic evolution: an iterative LLM-driven generation + selection loop that refines heuristics using a small validation set.
Extensive empirical study on Blocksworld, Game of 24, and 2×2 Rubik’s Cube showing large accuracy gains vs common baselines, with released code in Sys2Bench.
Key Findings
AutoHD substantially improves planning accuracy on Blocksworld when compared to baselines.
On Rubik's Cube AutoHD beats strong baselines and a trained policy-based method.
Heuristic evolution helps, but gains saturate quickly.
Results
Accuracy
Accuracy
Accuracy
Who Should Care
What To Try In 7 Days
Prompt your preferred LLM to generate simple heuristic scoring functions for one planning task (use provided prompts).
Run heuristic-guided greedy BFS or A* using the generated heuristics and compare to your current CoT pipeline.
Implement a small evolution loop: generate ~10 heuristics, validate on ~10 held examples, pick the best and test.
Agent Features
Planning
- heuristic-guided search
- A* search
- greedy BFS
Tool Use
- LLM code generation (Python heuristics)
Frameworks
- heuristic evolution (generation + selection)
Is Agentic
true
Architectures
- single-agent LLM-guided planner
Optimization Features
Token Efficiency
- fewer LLM calls at inference since heuristic is precomputed and reused
Inference Optimization
- use heuristic to avoid repeated LLM evaluations during search
Reproducibility
Code Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Validation sets for heuristic evolution are small (~10 examples), which risks overfitting heuristics to the validation split.
- Evaluations are on discrete toy planning problems (Blocksworld, Game of 24, 2×2 Rubik) with short horizons; generalization to large, continuous, or long-horizon tasks is untested.
- Method depends on LLMs generating correct, executable Python heuristics; generated code can be buggy or unsafe without checks.
- Search cost still grows with branching factor; heuristics reduce but do not eliminate expensive search for very large state spaces.
When Not To Use
- Tasks without a clear symbolic/structured state representation.
- High-dimensional continuous control or perception-heavy robotics where heuristics are hard to express in Python.
- Safety-critical systems unless the generated heuristics are formally validated.
Failure Modes
- LLM produces invalid or logically incorrect heuristic code that misguides search.
- Heuristic overfits the small validation set and fails on real test distributions.
- Search budget limits prevent finding solutions even with a good heuristic when branching is huge.
- Heuristics can encode LLM biases, leading search toward favored but wrong solutions.
Core Entities
Models
- GPT-4o
- GPT-4o-mini
- Llama 3.1 70B
- O1 mini
Metrics
- Accuracy
Datasets
- Blocksworld
- Game of 24
- 2x2 Rubik's Cube
Benchmarks
- Blocksworld
- Game of 24
- Rubik's Cube
Context Entities
Models
- XoT (policy+MCTS baseline)
- ToT (Tree of Thoughts)
- CoT, CoT-SC
Metrics
- self-consistency
- Accuracy

