Build query-specific evidence graphs on the fly to fix missing links and filter distractor facts

January 12, 20267 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.6

Citation Count

0

Authors

Manzong Huang, Chenyang Bu, Yi He, Xingrui Zhuo, Xindong Wu

Links

Abstract / PDF

Why It Matters For Business

Relink reduces multi-hop QA errors and increases robustness by building only the facts a query needs, cutting wrong reasoning and making answers easier to verify.

Summary TLDR

Relink replaces the usual static knowledge-graph then-reason pipeline with a "reason-and-construct" flow that builds a compact, query-specific evidence graph. It combines a high-precision KG backbone with a high-recall pool of latent relations (from entity co-occurrence + PMI). A query-driven ranker (coarse trainable ranker + LLM re-ranker) iteratively selects edges; when needed an LLM instantiates latent relations into factual triples. On five multi-hop QA benchmarks Relink improves average EM by 5.4% and F1 by 5.2% over strong GraphRAG baselines and stays robust when most KG edges are removed.

Problem Statement

GraphRAG methods rely on a static, pre-built knowledge graph. Static KGs are often incomplete and contain many query-relevant but misleading facts. This breaks multi-hop reasoning chains and amplifies distractors, so systems need a way to dynamically repair missing links and filter out misleading KG facts.

Main Contribution

Diagnose the limits of the build-then-reason paradigm: KG incompleteness and distractor facts break GraphRAG reasoning.

Propose Relink: a reason-and-construct framework that dynamically builds a compact, query-specific evidence graph from a factual KG plus a latent relation pool.

Design a unified query-aware ranking and LLM-based instantiation pipeline and show consistent improvements across five ODQA benchmarks with robustness experiments and ablations.

Key Findings

Relink yields consistent accuracy gains over leading GraphRAG baselines on five multi-hop QA datasets.

Numbersavg +5.4% EM; avg +5.2% F1 across five benchmarks

On 2WikiMultiHopQA Relink achieves EM=0.628 and F1=0.722.

Numbers2WikiMultiHopQA EM 0.628, F1 0.722

Relink is robust when the explicit KG is heavily degraded: it retains high F1 even with most edges removed.

NumbersWhen 90% KG edges removed, Relink F1 = 0.669 vs w/o R_c drops by 34.7% F1

Each core component (explicit KG, latent relation pool, query-driven ranker, contrastive alignment) contributes measurably.

NumbersAblation: removing ranker causes up to 19.4% relative EM drop on HotpotQA; removing G_b causes 12.9% EM drop

Results

EM (2WikiMultiHopQA)

Value0.628

BaselineHippoRAG 0.578

EM (HotpotQA)

Value0.558

BaselineHippoRAG 0.498

Average improvement

ValueEM +5.4%, F1 +5.2%

BaselineLeading GraphRAG baselines (avg)

Who Should Care

What To Try In 7 Days

Run Relink-style pipeline on a small QA slice: add a PMI-based latent relation pool from your corpus.

Train a lightweight coarse ranker to prioritize candidates for a few-hot paths and compare EM/F1 to your static KG baseline.

Use an LLM to instantiate top latent relations and inspect provenance for a handful of failing queries.

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Relies on LLM quality to instantiate latent relations; poor LLM outputs can introduce false facts.
  • Latent relation pool built from co-occurrence + PMI may surface spurious links without semantic filtering.
  • Runtime cost rises due to LLM re-ranking and on-the-fly instantiation compared with static KG methods.
  • Evaluation uses 500 sampled questions per dataset, which may limit variance estimates.

When Not To Use

  • When strict, immutable provenance is required and generated relations are unacceptable.
  • In low-latency or low-cost environments where extra LLM calls are prohibitive.
  • When your corpus is too small for reliable co-occurrence statistics.

Failure Modes

  • LLM-instantiated relations hallucinate plausible but incorrect triples.
  • Ranker fails to distinguish useful vs. merely related facts, letting distractors through.
  • High computation and latency from repeated LLM scoring and generation.

Core Entities

Models

  • deepseek-v3-0324
  • gpt-4o-2024-07-06
  • RAPTOR
  • GraphRAG
  • HippoRAG
  • G-Retriever
  • TOG
  • Vanilla RAG

Metrics

  • EM
  • F1

Datasets

  • 2WikiMultiHopQA
  • HotpotQA
  • ConcurrentQA
  • MuSiQue-Ans
  • MuSiQue-Full

Benchmarks

  • 2WikiMultiHopQA
  • HotpotQA
  • ConcurrentQA
  • MuSiQue-Ans
  • MuSiQue-Full

Context Entities

Models

  • OpenAI text-embedding-3-small

Datasets

  • 2WikiMultiHopQA
  • HotpotQA