Use a hierarchical graph of LLM 'thoughts' to improve retrieval and reduce hallucinations

February 14, 20247 min

Overview

Decision SnapshotReady For Pilot

HGOT shows practical gains on fact-heavy tasks by combining LLM planning with citation-aware voting and passage re-ranking, but it requires external search, tuning, and LLM costs.

Citations1

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 60%

Authors

Yihao Fang, Stephen W. Thomas, Xiaodan Zhu

Links

Abstract / PDF / Code / Data

Why It Matters For Business

HGOT reduces fact errors by structuring multi-step retrieval and scoring evidence; this can raise factual accuracy in verification and QA systems with modest engineering and tuning.

Who Should Care

Summary TLDR

HGOT builds a multi-level dependency graph of sub-questions (a hierarchical "graph of thoughts") and uses retrieval plus citation-aware scoring to pick answers. It weights majority voting by the quality of each generated rationale (measured via citation precision/recall) and re-ranks retrieved passages by citation frequency, thought quality, and retrieval rank. On public factual QA benchmarks HGOT improves factual accuracy (notably FEVER) and matches top baselines on Open-SQuAD and HotPotQA when tuned.

Problem Statement

LLMs hallucinate when answering fact-heavy queries. Simple retrieval-augmented prompts can miss interdependent facts or rank weak evidence. The paper asks: can a structured, multi-level plan plus citation-aware scoring reduce hallucinations and boost factual accuracy in retrieval-augmented in-context learning?

Main Contribution

HGOT: a method that builds a hierarchical dependency graph of sub-questions using LLM planning and searches those sub-questions recursively.

Thought-quality weighting: modifies self-consistency majority voting to weight answers by citation precision and recall of their rationales.

Key Findings

HGOT increases FEVER exact-match (EM) accuracy versus baselines.

NumbersHGOT+Sampling EM 61.50% vs Retrieve-then-Read 58.35% (Overall)

Practical UseExpect ~3 percentage-point EM gains on fact verification tasks by adding hierarchical planning and citation weighting to retrieval-augmented prompts.

Evidence RefTable 2 overall FEVER

HGOT matches or slightly surpasses leading baselines on Open-SQuAD and HotPotQA when tuned.

NumbersOpen-SQuAD HGOT+KNN EM 24.10% vs Retrieve-then-Read 22.51%; HotPotQA HGOT+KNN EM 47.37% vs DSP 47.23%

Practical UseOn extractive QA and multi-hop QA, HGOT is competitive; tuning (KNN/demo selection and hyperparams) is important to match top systems.

Evidence RefTable 2 overall Open-SQuAD and HotPotQA

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
FEVER EM (Overall)61.50%Retrieve-then-Read 58.35%+3.15 ppFEVER overallTable 2: HGOT+Sampling 61.50 vs Retrieve-then-Read 58.35
Open-SQuAD EM (Overall)24.10%Retrieve-then-Read 22.51%+1.59 ppOpen-SQuAD overallTable 2: HGOT+KNN 24.10 vs Retrieve-then-Read 22.51

What To Try In 7 Days

Run HGOT's Probe+Plan+Infer loop on a small FEVER-like subset to compare EM gains versus your current retrieval pipeline.

Add citation-precision and recall to your self-consistency voting and measure top-K change.

Swap SerpApi for your production search and compare passage re-ranking with the paper's weighted citation score.

Agent Features

Memory
retrieval-based context collected per sub-query
Planning
LLM emergent planning to split queriesdivide-and-conquer sub-query planstopological sorting of dependency graph
Tool Use
search engine retrieval (SerpApi/Google)NLI model for citation detectionLM for planning and reasoning
Frameworks
DSP (framework used)NetworkX for graph operations
Is Agentic

Yes

Architectures
hierarchical graph (multi-layer DAG of sub-queries)dependency graph with topological traversal
Collaboration
weighted self-consistency majority votingdemonstration selection (balanced sampling / KNN)

Optimization Features

Token Efficiency
rewrite sub-queries to include prior answers to limit context growth
Inference Optimization
select top-K passages for final promptearly stopping of planning when step similar

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

FEVEROpen-SQuADHotPotQA

Risks & Boundaries

Limitations

Evaluated only with ChatGPT (gpt-3.5-turbo-1106); other LMs like Gemini or Llama 2 not tested.

Retrieval limited to SerpApi Google Search; results may change with different search engines or domain sources.

When Not To Use

Low-latency or low-cost services where multiple retrieval steps and LM calls are infeasible.

Tasks that do not require external factual grounding or where a single passage suffices.

Failure Modes

Incorrect dependency graph from planning leads to wrong sub-queries and cascading errors.

Citation detection/NLI errors can mis-weight good evidence and promote wrong answers.

Core Entities

Models

ChatGPT (gpt-3.5-turbo-1106)text-davinci-002 (for ReAct baseline)NLI model (TRUE-style Honovich et al. 2022)

Metrics

EMF1citation precisioncitation recallself-consistency confidence

Datasets

FEVEROpen-SQuADHotPotQA

Benchmarks

Exact Match (EM)F1 score