AF-Retriever: a hybrid, LLM-driven pipeline that combines graph constraints and vector search to improve multi-hop QA over semi-structured K

Overview

Decision SnapshotReady For Pilot

Combining graph constraints and VSS then reranking with an LLM yields consistent gains; reasoning improvements are strongest in relation-rich SKBs and depend on LLM quality.

Citations0

Evidence Strength0.85

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 70%

Authors

Derian Boer, Stephen Roth, Stefan Kramer

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If your app uses mixed graph and text data, AF-Retriever gives large zero-shot retrieval gains and traceable answers without domain fine-tuning, saving dataset curation time while improving top-1 accuracy.

Who Should Care

CTO ML Engineer Product Manager Data Scientist

Summary TLDR

AF-Retriever is a modular retrieval pipeline for multi-hop QA over Semi-Structured Knowledge Bases (SKBs). It uses LLMs to predict answer types and translate questions into Cypher-like triplets, grounds those constraints via graph set intersections, expands candidate scope incrementally, supplements with vector similarity search (VSS), then reranks top candidates with an LLM. On the STaRK benchmarks (PRIME, MAG, AMAZON) it sets new zero-/one-shot state-of-the-art scores and provides ablations showing each step adds value. The code is public.

Problem Statement

Real-world QA needs both structured relations (graphs/tables) and unstructured text. Existing methods use one or the other or mix components in isolation. The problem: how to reliably combine structural constraints and textual retrieval for multi-hop questions over SKBs without domain-specific fine-tuning.

Main Contribution

AF-Retriever: an eight-step, modular pipeline that fuses text and graph retrieval for SKB QA.

A practical text-to-Cypher extraction step that uses off-the-shelf LLMs to produce relational triplets.

Key Findings

AF-Retriever substantially improves first-hit rates over previous zero-/one-shot methods on STaRK.

NumbersAvg hit@1 increase vs second-best = 32.1% (abstract); AF-Retriever hit@1: 62.0% (Table 2 avg synthetic).

Practical UseIf you need high-precision candidate ranking on SKBs without fine-tuning, use a hybrid approach like AF-Retriever to get large gains in top-1 retrieval.

Evidence RefAbstract; Table 2

LLM reranking materially raises top-1 accuracy after candidate retrieval.

NumbersExample: on MAG hit@1 increases from 59.5% (steps 1–7) to 78.6% (full pipeline with reranker): +19.1 pp (Table 4).

Practical UseAlways include an LLM-based reranker (pairwise or listwise) when latency and cost permit; it converts broad recall into usable top answers.

Evidence RefTable 4

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
hit@1 (avg synthetic STaRK)	62.0%	previous best zero/one-shot	+32.1% (vs second-best avg)	averaged over PRIME,MAG,AMAZON (synthetic)	Table 2 averaged synthetic test sets	Table 2
hit@20 (PRIME)	71.9%	AF-Retriever ablations	improves over steps 1–6; see Table 4	PRIME (synthetic)	Tables 3 and 4	Table 3; Table 4

What To Try In 7 Days

Clone the repo and run AF-Retriever on a small SKB to reproduce results.

Measure hit@1/hit@5 on your queries vs plain VSS or BM25.

Tune α (graph vs VSS weight) and l_max on a small validation set to balance precision and recall.

Agent Features

Tool Use

LLMs for parsing and rerankingVector similarity searchCypher-style query formalization

Optimization Features

Token Efficiency

listwise reranker uses far fewer prompts and tokens than pairwise

Infra Optimization

LLM latency dominates; use vLLM or batching to reduce cost

System Optimization

α weight tuning to balance graph and vector strandstune l_max to control candidate scope

Inference Optimization

precompute node embeddingschoose listwise vs pairwise reranker to trade tokens vs latency

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/kramerlab/AF-Retriever

Data URLs

STaRK benchmarks (Wu et al., 2024b) - used in evaluation (referenced in paper)

Risks & Boundaries

Limitations

Runs best with large LLMs; performance and stability depend on LLM choice, especially in complex domains.

LLM reranking is costly and often dominates runtime and monetary cost.

When Not To Use

When you lack access to sufficiently capable LLMs or cannot bear LLM latency/cost.

If your data is purely unstructured text without node-to-document links.

Failure Modes

Incorrect Cypher generation by the LLM, producing wrong or missing triplets.

Missing graph links: necessary relations absent in SKB prevent grounding.

Core Entities

Models

GPT OSS 120BGPT-5 miniGPT OSS 20BAda-002Multi-ada-002

Metrics

hit@1hit@5hit@20recall@20mrr

Datasets

STaRK: PRIMESTaRK: MAGSTaRK: AMAZON

Benchmarks

STaRK

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

AF-Retriever substantially improves first-hit rates over previous zero-/one-shot methods on STaRK.

LLM reranking materially raises top-1 accuracy after candidate retrieval.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

RAGElo: use synthetic queries + LLM-as-judge + Elo tournaments to compare RAG vs RAG-Fusion on company docs

Key finding

Use multi-agent RAG plus a hybrid vector-graph memory to auto-generate traceable test plans and cases, cutting test-document work by ~85% in

Key finding

An LLM agent that first pulls subgraphs from Wikidata, then triggers focused web search and prompt-based self-improvement to improve fact‑f​

Key finding

RAG + a 10M‑token Vedanta corpus cuts hallucinations for niche long‑form QA

Key finding

HybridRAG-Bench: contamination-aware tests that force retrieval + multi-hop reasoning over text + knowledge graphs

Key finding

An LLM agent that first pulls subgraphs from Wikidata, then triggers focused web search and prompt-based self-improvement to improve fact‑f