Overview
The method is practical for discrete, small-to-moderate planning tasks where you can encode states and run search; evidence is strong on toy benchmarks but limited to the tested domains and small validation sizes.
Citations0
Evidence Strength0.80
Confidence0.80
Risk Signals11
Trust Signals
Findings with numeric evidence: 3/3
Findings with evidence refs: 3/3
Results with explicit delta: 3/3
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 55%
Production readiness: 65%
Novelty: 60%
Why It Matters For Business
AutoHD improves LLM planning accuracy without extra model training and produces interpretable Python heuristics you can inspect and reuse.
Who Should Care
Summary TLDR
AutoHD prompts an LLM to generate explicit heuristic functions (as Python) and then refines them via an LLM-driven evolution loop. The best heuristic guides inference-time search (A* or greedy BFS) so the LLM no longer needs to self-verify every intermediate step. On three planning benchmarks (Blocksworld, Game of 24, Rubik's Cube), AutoHD raises accuracy substantially versus CoT/ToT baselines and other search or verifier methods, while requiring no model fine-tuning and providing interpretable heuristic code.
Problem Statement
LLM-based planning methods either rely on unreliable LLM self-verification or costly external verifiers. We need a lightweight, interpretable way to evaluate intermediate states so search can be guided accurately at inference time without extra model training.
Main Contribution
AutoHD: a pipeline that prompts an LLM to produce heuristic functions as Python, then uses those heuristics to guide search during inference.
Heuristic evolution: an iterative LLM-driven generation + selection loop that refines heuristics using a small validation set.
Key Findings
AutoHD substantially improves planning accuracy on Blocksworld when compared to baselines.
On Rubik's Cube AutoHD beats strong baselines and a trained policy-based method.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | 42.4% / 75.1% / 59.1% | best baseline varies (ToT/CoT-SC etc.) | roughly up to ~2× vs baselines on some LLMs | Blocksworld (All) | Table 8; Section 4.1 | Table 8 |
| Accuracy | 54% / 70% / 69% | CoT/CoT-SC/ToT | substantial absolute gains vs simple methods (IO/CoT) | Game of 24 | Section 4.2 and Table 3 | Table 3 |
What To Try In 7 Days
Prompt your preferred LLM to generate simple heuristic scoring functions for one planning task (use provided prompts).
Run heuristic-guided greedy BFS or A* using the generated heuristics and compare to your current CoT pipeline.
Implement a small evolution loop: generate ~10 heuristics, validate on ~10 held examples, pick the best and test.
Agent Features
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Optimization Features
Token Efficiency
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Validation sets for heuristic evolution are small (~10 examples), which risks overfitting heuristics to the validation split.
Evaluations are on discrete toy planning problems (Blocksworld, Game of 24, 2×2 Rubik) with short horizons; generalization to large, continuous, or long-horizon tasks is untested.
When Not To Use
Tasks without a clear symbolic/structured state representation.
High-dimensional continuous control or perception-heavy robotics where heuristics are hard to express in Python.
Failure Modes
LLM produces invalid or logically incorrect heuristic code that misguides search.
Heuristic overfits the small validation set and fails on real test distributions.

