Overview
Combining graph constraints and VSS then reranking with an LLM yields consistent gains; reasoning improvements are strongest in relation-rich SKBs and depend on LLM quality.
Citations0
Evidence Strength0.85
Confidence0.85
Risk Signals10
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 3/3
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 70%
Novelty: 70%
Why It Matters For Business
If your app uses mixed graph and text data, AF-Retriever gives large zero-shot retrieval gains and traceable answers without domain fine-tuning, saving dataset curation time while improving top-1 accuracy.
Who Should Care
Summary TLDR
AF-Retriever is a modular retrieval pipeline for multi-hop QA over Semi-Structured Knowledge Bases (SKBs). It uses LLMs to predict answer types and translate questions into Cypher-like triplets, grounds those constraints via graph set intersections, expands candidate scope incrementally, supplements with vector similarity search (VSS), then reranks top candidates with an LLM. On the STaRK benchmarks (PRIME, MAG, AMAZON) it sets new zero-/one-shot state-of-the-art scores and provides ablations showing each step adds value. The code is public.
Problem Statement
Real-world QA needs both structured relations (graphs/tables) and unstructured text. Existing methods use one or the other or mix components in isolation. The problem: how to reliably combine structural constraints and textual retrieval for multi-hop questions over SKBs without domain-specific fine-tuning.
Main Contribution
AF-Retriever: an eight-step, modular pipeline that fuses text and graph retrieval for SKB QA.
A practical text-to-Cypher extraction step that uses off-the-shelf LLMs to produce relational triplets.
Key Findings
AF-Retriever substantially improves first-hit rates over previous zero-/one-shot methods on STaRK.
LLM reranking materially raises top-1 accuracy after candidate retrieval.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| hit@1 (avg synthetic STaRK) | 62.0% | previous best zero/one-shot | +32.1% (vs second-best avg) | averaged over PRIME,MAG,AMAZON (synthetic) | Table 2 averaged synthetic test sets | Table 2 |
| hit@20 (PRIME) | 71.9% | AF-Retriever ablations | improves over steps 1–6; see Table 4 | PRIME (synthetic) | Tables 3 and 4 | Table 3; Table 4 |
What To Try In 7 Days
Clone the repo and run AF-Retriever on a small SKB to reproduce results.
Measure hit@1/hit@5 on your queries vs plain VSS or BM25.
Tune α (graph vs VSS weight) and l_max on a small validation set to balance precision and recall.
Agent Features
Tool Use
Optimization Features
Token Efficiency
Infra Optimization
System Optimization
Inference Optimization
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Runs best with large LLMs; performance and stability depend on LLM choice, especially in complex domains.
LLM reranking is costly and often dominates runtime and monetary cost.
When Not To Use
When you lack access to sufficiently capable LLMs or cannot bear LLM latency/cost.
If your data is purely unstructured text without node-to-document links.
Failure Modes
Incorrect Cypher generation by the LLM, producing wrong or missing triplets.
Missing graph links: necessary relations absent in SKB prevent grounding.

