AF-Retriever: a hybrid, LLM-driven pipeline that combines graph constraints and vector search to improve multi-hop QA over semi-structured K

May 14, 20258 min

Overview

Production Readiness

0.7

Novelty Score

0.7

Cost Impact Score

0.6

Citation Count

0

Authors

Derian Boer, Stephen Roth, Stefan Kramer

Links

Abstract / PDF

Why It Matters For Business

If your app uses mixed graph and text data, AF-Retriever gives large zero-shot retrieval gains and traceable answers without domain fine-tuning, saving dataset curation time while improving top-1 accuracy.

Summary TLDR

AF-Retriever is a modular retrieval pipeline for multi-hop QA over Semi-Structured Knowledge Bases (SKBs). It uses LLMs to predict answer types and translate questions into Cypher-like triplets, grounds those constraints via graph set intersections, expands candidate scope incrementally, supplements with vector similarity search (VSS), then reranks top candidates with an LLM. On the STaRK benchmarks (PRIME, MAG, AMAZON) it sets new zero-/one-shot state-of-the-art scores and provides ablations showing each step adds value. The code is public.

Problem Statement

Real-world QA needs both structured relations (graphs/tables) and unstructured text. Existing methods use one or the other or mix components in isolation. The problem: how to reliably combine structural constraints and textual retrieval for multi-hop questions over SKBs without domain-specific fine-tuning.

Main Contribution

AF-Retriever: an eight-step, modular pipeline that fuses text and graph retrieval for SKB QA.

A practical text-to-Cypher extraction step that uses off-the-shelf LLMs to produce relational triplets.

A novel incremental scope-expansion procedure that balances specificity and sensitivity when grounding constants.

A hybrid two-strand retrieval design (graph-based grounding + VSS ensemble) and comparison of three LLM reranking strategies (pointwise/pairwise/listwise).

Extensive ablation and error analysis across three STaRK benchmarks and a public code release.

Key Findings

AF-Retriever substantially improves first-hit rates over previous zero-/one-shot methods on STaRK.

NumbersAvg hit@1 increase vs second-best = 32.1% (abstract); AF-Retriever hit@1: 62.0% (Table 2 avg synthetic).

LLM reranking materially raises top-1 accuracy after candidate retrieval.

NumbersExample: on MAG hit@1 increases from 59.5% (steps 1–7) to 78.6% (full pipeline with reranker): +19.1 pp (Table 4).

The scope-expansion hyperparameter l_max must be tuned per domain.

NumbersOptimal l_max found: PRIME=100, MAG=10, AMAZON=1 (A.6.3 Table 14).

Performance depends on the choice of LLM for Cypher extraction and reranking, especially for complex domains.

NumbersHit@20 on PRIME varies across LLMs (e.g., GPT OSS 120B 71.9% vs LLaMa4 Scout 58.0%) (Table 11).

Results

hit@1 (avg synthetic STaRK)

Value62.0%

Baselineprevious best zero/one-shot

hit@20 (PRIME)

Value71.9%

BaselineAF-Retriever ablations

Reranker effect (MAG hit@1)

Value78.6% (full pipeline)

Baseline59.5% (without reranker)

Who Should Care

What To Try In 7 Days

Clone the repo and run AF-Retriever on a small SKB to reproduce results.

Measure hit@1/hit@5 on your queries vs plain VSS or BM25.

Tune α (graph vs VSS weight) and l_max on a small validation set to balance precision and recall.

Agent Features

Tool Use

  • LLMs for parsing and reranking
  • Vector similarity search
  • Cypher-style query formalization

Optimization Features

Token Efficiency

  • listwise reranker uses far fewer prompts and tokens than pairwise

Infra Optimization

  • LLM latency dominates; use vLLM or batching to reduce cost

System Optimization

  • α weight tuning to balance graph and vector strands
  • tune l_max to control candidate scope

Inference Optimization

  • precompute node embeddings
  • choose listwise vs pairwise reranker to trade tokens vs latency

Reproducibility

Data Urls

  • STaRK benchmarks (Wu et al., 2024b) - used in evaluation (referenced in paper)

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Runs best with large LLMs; performance and stability depend on LLM choice, especially in complex domains.
  • LLM reranking is costly and often dominates runtime and monetary cost.
  • Requires a semi-structured KB that links nodes to text; not applicable to pure text or pure KG-only setups without adaptation.

When Not To Use

  • When you lack access to sufficiently capable LLMs or cannot bear LLM latency/cost.
  • If your data is purely unstructured text without node-to-document links.
  • When ultra-low latency (<100ms) is required for each query.

Failure Modes

  • Incorrect Cypher generation by the LLM, producing wrong or missing triplets.
  • Missing graph links: necessary relations absent in SKB prevent grounding.
  • Scope-expansion tuned too large (high l_max) increases false positives.
  • Reranker misordering due to LLM hallucination or limited context window.

Core Entities

Models

  • GPT OSS 120B
  • GPT-5 mini
  • GPT OSS 20B
  • Ada-002
  • Multi-ada-002

Metrics

  • hit@1
  • hit@5
  • hit@20
  • recall@20
  • mrr

Datasets

  • STaRK: PRIME
  • STaRK: MAG
  • STaRK: AMAZON

Benchmarks

  • STaRK