Use a hierarchical graph of LLM 'thoughts' to improve retrieval and reduce hallucinations

Overview

Decision SnapshotReady For Pilot

HGOT shows practical gains on fact-heavy tasks by combining LLM planning with citation-aware voting and passage re-ranking, but it requires external search, tuning, and LLM costs.

Citations1

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 60%

Authors

Yihao Fang, Stephen W. Thomas, Xiaodan Zhu

Links

Abstract / PDF / Code / Data

Why It Matters For Business

HGOT reduces fact errors by structuring multi-step retrieval and scoring evidence; this can raise factual accuracy in verification and QA systems with modest engineering and tuning.

Who Should Care

Product Manager ML Engineer Data Scientist CTO Engineering Lead

Summary TLDR

HGOT builds a multi-level dependency graph of sub-questions (a hierarchical "graph of thoughts") and uses retrieval plus citation-aware scoring to pick answers. It weights majority voting by the quality of each generated rationale (measured via citation precision/recall) and re-ranks retrieved passages by citation frequency, thought quality, and retrieval rank. On public factual QA benchmarks HGOT improves factual accuracy (notably FEVER) and matches top baselines on Open-SQuAD and HotPotQA when tuned.

Problem Statement

LLMs hallucinate when answering fact-heavy queries. Simple retrieval-augmented prompts can miss interdependent facts or rank weak evidence. The paper asks: can a structured, multi-level plan plus citation-aware scoring reduce hallucinations and boost factual accuracy in retrieval-augmented in-context learning?

Main Contribution

HGOT: a method that builds a hierarchical dependency graph of sub-questions using LLM planning and searches those sub-questions recursively.

Thought-quality weighting: modifies self-consistency majority voting to weight answers by citation precision and recall of their rationales.

Key Findings

HGOT increases FEVER exact-match (EM) accuracy versus baselines.

NumbersHGOT+Sampling EM 61.50% vs Retrieve-then-Read 58.35% (Overall)

Practical UseExpect ~3 percentage-point EM gains on fact verification tasks by adding hierarchical planning and citation weighting to retrieval-augmented prompts.

Evidence RefTable 2 overall FEVER

HGOT matches or slightly surpasses leading baselines on Open-SQuAD and HotPotQA when tuned.

NumbersOpen-SQuAD HGOT+KNN EM 24.10% vs Retrieve-then-Read 22.51%; HotPotQA HGOT+KNN EM 47.37% vs DSP 47.23%

Practical UseOn extractive QA and multi-hop QA, HGOT is competitive; tuning (KNN/demo selection and hyperparams) is important to match top systems.

Evidence RefTable 2 overall Open-SQuAD and HotPotQA

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
FEVER EM (Overall)	61.50%	Retrieve-then-Read 58.35%	+3.15 pp	FEVER overall	Table 2: HGOT+Sampling 61.50 vs Retrieve-then-Read 58.35	—
Open-SQuAD EM (Overall)	24.10%	Retrieve-then-Read 22.51%	+1.59 pp	Open-SQuAD overall	Table 2: HGOT+KNN 24.10 vs Retrieve-then-Read 22.51	—

What To Try In 7 Days

Run HGOT's Probe+Plan+Infer loop on a small FEVER-like subset to compare EM gains versus your current retrieval pipeline.

Add citation-precision and recall to your self-consistency voting and measure top-K change.

Swap SerpApi for your production search and compare passage re-ranking with the paper's weighted citation score.

Agent Features

Memory

retrieval-based context collected per sub-query

Planning

LLM emergent planning to split queriesdivide-and-conquer sub-query planstopological sorting of dependency graph

Tool Use

search engine retrieval (SerpApi/Google)NLI model for citation detectionLM for planning and reasoning

Frameworks

DSP (framework used)NetworkX for graph operations

Is Agentic

Yes

Architectures

hierarchical graph (multi-layer DAG of sub-queries)dependency graph with topological traversal

Collaboration

weighted self-consistency majority votingdemonstration selection (balanced sampling / KNN)

Optimization Features

Token Efficiency

rewrite sub-queries to include prior answers to limit context growth

Inference Optimization

select top-K passages for final promptearly stopping of planning when step similar

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/fangyihao/hgot

Data URLs

FEVEROpen-SQuADHotPotQA

Risks & Boundaries

Limitations

Evaluated only with ChatGPT (gpt-3.5-turbo-1106); other LMs like Gemini or Llama 2 not tested.

Retrieval limited to SerpApi Google Search; results may change with different search engines or domain sources.

When Not To Use

Low-latency or low-cost services where multiple retrieval steps and LM calls are infeasible.

Tasks that do not require external factual grounding or where a single passage suffices.

Failure Modes

Incorrect dependency graph from planning leads to wrong sub-queries and cascading errors.

Citation detection/NLI errors can mis-weight good evidence and promote wrong answers.

Core Entities

Models

ChatGPT (gpt-3.5-turbo-1106)text-davinci-002 (for ReAct baseline)NLI model (TRUE-style Honovich et al. 2022)

Metrics

EMF1citation precisioncitation recallself-consistency confidence

Datasets

FEVEROpen-SQuADHotPotQA

Benchmarks

Exact Match (EM)F1 score

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

HGOT increases FEVER exact-match (EM) accuracy versus baselines.

HGOT matches or slightly surpasses leading baselines on Open-SQuAD and HotPotQA when tuned.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

Hallucinations in LLMs are diverse, theoretically inevitable, and must be managed with grounding and human oversight

Key finding

SimpleVQA — a 2,025-sample bilingual VQA benchmark that tests multimodal LLM factuality with atomic-fact probes

Key finding

LLMs misjudge mixed-context hallucinations: external retrieval helps but factual cases remain hard

Key finding

LLMs (GPT-3.5, GPT-4, PaLM-2) do not reliably judge factuality on the FRANK benchmark

Key finding