AF-Retriever: a hybrid, LLM-driven pipeline that combines graph constraints and vector search to improve multi-hop QA over semi-structured K

May 14, 20258 min

Overview

Decision SnapshotReady For Pilot

Combining graph constraints and VSS then reranking with an LLM yields consistent gains; reasoning improvements are strongest in relation-rich SKBs and depend on LLM quality.

Citations0

Evidence Strength0.85

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 70%

Authors

Derian Boer, Stephen Roth, Stefan Kramer

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If your app uses mixed graph and text data, AF-Retriever gives large zero-shot retrieval gains and traceable answers without domain fine-tuning, saving dataset curation time while improving top-1 accuracy.

Who Should Care

Summary TLDR

AF-Retriever is a modular retrieval pipeline for multi-hop QA over Semi-Structured Knowledge Bases (SKBs). It uses LLMs to predict answer types and translate questions into Cypher-like triplets, grounds those constraints via graph set intersections, expands candidate scope incrementally, supplements with vector similarity search (VSS), then reranks top candidates with an LLM. On the STaRK benchmarks (PRIME, MAG, AMAZON) it sets new zero-/one-shot state-of-the-art scores and provides ablations showing each step adds value. The code is public.

Problem Statement

Real-world QA needs both structured relations (graphs/tables) and unstructured text. Existing methods use one or the other or mix components in isolation. The problem: how to reliably combine structural constraints and textual retrieval for multi-hop questions over SKBs without domain-specific fine-tuning.

Main Contribution

AF-Retriever: an eight-step, modular pipeline that fuses text and graph retrieval for SKB QA.

A practical text-to-Cypher extraction step that uses off-the-shelf LLMs to produce relational triplets.

Key Findings

AF-Retriever substantially improves first-hit rates over previous zero-/one-shot methods on STaRK.

NumbersAvg hit@1 increase vs second-best = 32.1% (abstract); AF-Retriever hit@1: 62.0% (Table 2 avg synthetic).

Practical UseIf you need high-precision candidate ranking on SKBs without fine-tuning, use a hybrid approach like AF-Retriever to get large gains in top-1 retrieval.

Evidence RefAbstract; Table 2

LLM reranking materially raises top-1 accuracy after candidate retrieval.

NumbersExample: on MAG hit@1 increases from 59.5% (steps 17) to 78.6% (full pipeline with reranker): +19.1 pp (Table 4).

Practical UseAlways include an LLM-based reranker (pairwise or listwise) when latency and cost permit; it converts broad recall into usable top answers.

Evidence RefTable 4

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
hit@1 (avg synthetic STaRK)62.0%previous best zero/one-shot+32.1% (vs second-best avg)averaged over PRIME,MAG,AMAZON (synthetic)Table 2 averaged synthetic test setsTable 2
hit@20 (PRIME)71.9%AF-Retriever ablationsimproves over steps 16; see Table 4PRIME (synthetic)Tables 3 and 4Table 3; Table 4

What To Try In 7 Days

Clone the repo and run AF-Retriever on a small SKB to reproduce results.

Measure hit@1/hit@5 on your queries vs plain VSS or BM25.

Tune α (graph vs VSS weight) and l_max on a small validation set to balance precision and recall.

Agent Features

Tool Use
LLMs for parsing and rerankingVector similarity searchCypher-style query formalization

Optimization Features

Token Efficiency
listwise reranker uses far fewer prompts and tokens than pairwise
Infra Optimization
LLM latency dominates; use vLLM or batching to reduce cost
System Optimization
α weight tuning to balance graph and vector strandstune l_max to control candidate scope
Inference Optimization
precompute node embeddingschoose listwise vs pairwise reranker to trade tokens vs latency

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

STaRK benchmarks (Wu et al., 2024b) - used in evaluation (referenced in paper)

Risks & Boundaries

Limitations

Runs best with large LLMs; performance and stability depend on LLM choice, especially in complex domains.

LLM reranking is costly and often dominates runtime and monetary cost.

When Not To Use

When you lack access to sufficiently capable LLMs or cannot bear LLM latency/cost.

If your data is purely unstructured text without node-to-document links.

Failure Modes

Incorrect Cypher generation by the LLM, producing wrong or missing triplets.

Missing graph links: necessary relations absent in SKB prevent grounding.

Core Entities

Models

GPT OSS 120BGPT-5 miniGPT OSS 20BAda-002Multi-ada-002

Metrics

hit@1hit@5hit@20recall@20mrr

Datasets

STaRK: PRIMESTaRK: MAGSTaRK: AMAZON

Benchmarks

STaRK