Use a hierarchical graph of LLM 'thoughts' to improve retrieval and reduce hallucinations

February 14, 20247 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

1

Authors

Yihao Fang, Stephen W. Thomas, Xiaodan Zhu

Links

Abstract / PDF

Why It Matters For Business

HGOT reduces fact errors by structuring multi-step retrieval and scoring evidence; this can raise factual accuracy in verification and QA systems with modest engineering and tuning.

Summary TLDR

HGOT builds a multi-level dependency graph of sub-questions (a hierarchical "graph of thoughts") and uses retrieval plus citation-aware scoring to pick answers. It weights majority voting by the quality of each generated rationale (measured via citation precision/recall) and re-ranks retrieved passages by citation frequency, thought quality, and retrieval rank. On public factual QA benchmarks HGOT improves factual accuracy (notably FEVER) and matches top baselines on Open-SQuAD and HotPotQA when tuned.

Problem Statement

LLMs hallucinate when answering fact-heavy queries. Simple retrieval-augmented prompts can miss interdependent facts or rank weak evidence. The paper asks: can a structured, multi-level plan plus citation-aware scoring reduce hallucinations and boost factual accuracy in retrieval-augmented in-context learning?

Main Contribution

HGOT: a method that builds a hierarchical dependency graph of sub-questions using LLM planning and searches those sub-questions recursively.

Thought-quality weighting: modifies self-consistency majority voting to weight answers by citation precision and recall of their rationales.

Retrieval scoring: a passage score that combines citation frequency, thought citation quality, self-consistency confidence, and retrieval rank.

Extensive evaluation stratified by question length on FEVER, Open-SQuAD, and HotPotQA, including ablations and hyperparameter search.

Key Findings

HGOT increases FEVER exact-match (EM) accuracy versus baselines.

NumbersHGOT+Sampling EM 61.50% vs Retrieve-then-Read 58.35% (Overall)

HGOT matches or slightly surpasses leading baselines on Open-SQuAD and HotPotQA when tuned.

NumbersOpen-SQuAD HGOT+KNN EM 24.10% vs Retrieve-then-Read 22.51%; HotPotQA HGOT+KNN EM 47.37% vs DSP 47.23%

Incorporating thought and retrieval quality improves performance in ablation tests.

NumbersOpen-SQuAD (medium): best EM 31.45% with α=0.2,β=0.4,γ=0.4 vs 28.30% with α=1,β=0,γ=0 (no thought quality)

Results

FEVER EM (Overall)

Value61.50%

BaselineRetrieve-then-Read 58.35%

Open-SQuAD EM (Overall)

Value24.10%

BaselineRetrieve-then-Read 22.51%

HotPotQA EM (Overall)

Value47.37%

BaselineDSP 47.23%

Open-SQuAD EM (medium, abl.)

Value31.45%

Baselineno thought-quality EM 28.30%

Who Should Care

What To Try In 7 Days

Run HGOT's Probe+Plan+Infer loop on a small FEVER-like subset to compare EM gains versus your current retrieval pipeline.

Add citation-precision and recall to your self-consistency voting and measure top-K change.

Swap SerpApi for your production search and compare passage re-ranking with the paper's weighted citation score.

Agent Features

Memory

  • retrieval-based context collected per sub-query

Planning

  • LLM emergent planning to split queries
  • divide-and-conquer sub-query plans
  • topological sorting of dependency graph

Tool Use

  • search engine retrieval (SerpApi/Google)
  • NLI model for citation detection
  • LM for planning and reasoning

Frameworks

  • DSP (framework used)
  • NetworkX for graph operations

Is Agentic

true

Architectures

  • hierarchical graph (multi-layer DAG of sub-queries)
  • dependency graph with topological traversal

Collaboration

  • weighted self-consistency majority voting
  • demonstration selection (balanced sampling / KNN)

Optimization Features

Token Efficiency

  • rewrite sub-queries to include prior answers to limit context growth

Inference Optimization

  • select top-K passages for final prompt
  • early stopping of planning when step similar

Reproducibility

Data Urls

  • FEVER
  • Open-SQuAD
  • HotPotQA

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Evaluated only with ChatGPT (gpt-3.5-turbo-1106); other LMs like Gemini or Llama 2 not tested.
  • Retrieval limited to SerpApi Google Search; results may change with different search engines or domain sources.
  • Additional latency and API costs due to recursive planning, multiple retrievals, and NLI passes.

When Not To Use

  • Low-latency or low-cost services where multiple retrieval steps and LM calls are infeasible.
  • Tasks that do not require external factual grounding or where a single passage suffices.
  • Domains with no reliable external search index or NLI model for citation detection.

Failure Modes

  • Incorrect dependency graph from planning leads to wrong sub-queries and cascading errors.
  • Citation detection/NLI errors can mis-weight good evidence and promote wrong answers.
  • Over-reliance on web search ranking may surface high-ranked but irrelevant passages.

Core Entities

Models

  • ChatGPT (gpt-3.5-turbo-1106)
  • text-davinci-002 (for ReAct baseline)
  • NLI model (TRUE-style Honovich et al. 2022)

Metrics

  • EM
  • F1
  • citation precision
  • citation recall
  • self-consistency confidence

Datasets

  • FEVER
  • Open-SQuAD
  • HotPotQA

Benchmarks

  • Exact Match (EM)
  • F1 score