Overview
Production Readiness
0.7
Novelty Score
0.7
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
If your app uses mixed graph and text data, AF-Retriever gives large zero-shot retrieval gains and traceable answers without domain fine-tuning, saving dataset curation time while improving top-1 accuracy.
Summary TLDR
AF-Retriever is a modular retrieval pipeline for multi-hop QA over Semi-Structured Knowledge Bases (SKBs). It uses LLMs to predict answer types and translate questions into Cypher-like triplets, grounds those constraints via graph set intersections, expands candidate scope incrementally, supplements with vector similarity search (VSS), then reranks top candidates with an LLM. On the STaRK benchmarks (PRIME, MAG, AMAZON) it sets new zero-/one-shot state-of-the-art scores and provides ablations showing each step adds value. The code is public.
Problem Statement
Real-world QA needs both structured relations (graphs/tables) and unstructured text. Existing methods use one or the other or mix components in isolation. The problem: how to reliably combine structural constraints and textual retrieval for multi-hop questions over SKBs without domain-specific fine-tuning.
Main Contribution
AF-Retriever: an eight-step, modular pipeline that fuses text and graph retrieval for SKB QA.
A practical text-to-Cypher extraction step that uses off-the-shelf LLMs to produce relational triplets.
A novel incremental scope-expansion procedure that balances specificity and sensitivity when grounding constants.
A hybrid two-strand retrieval design (graph-based grounding + VSS ensemble) and comparison of three LLM reranking strategies (pointwise/pairwise/listwise).
Extensive ablation and error analysis across three STaRK benchmarks and a public code release.
Key Findings
AF-Retriever substantially improves first-hit rates over previous zero-/one-shot methods on STaRK.
LLM reranking materially raises top-1 accuracy after candidate retrieval.
The scope-expansion hyperparameter l_max must be tuned per domain.
Performance depends on the choice of LLM for Cypher extraction and reranking, especially for complex domains.
Results
hit@1 (avg synthetic STaRK)
hit@20 (PRIME)
Reranker effect (MAG hit@1)
Who Should Care
What To Try In 7 Days
Clone the repo and run AF-Retriever on a small SKB to reproduce results.
Measure hit@1/hit@5 on your queries vs plain VSS or BM25.
Tune α (graph vs VSS weight) and l_max on a small validation set to balance precision and recall.
Agent Features
Tool Use
- LLMs for parsing and reranking
- Vector similarity search
- Cypher-style query formalization
Optimization Features
Token Efficiency
- listwise reranker uses far fewer prompts and tokens than pairwise
Infra Optimization
- LLM latency dominates; use vLLM or batching to reduce cost
System Optimization
- α weight tuning to balance graph and vector strands
- tune l_max to control candidate scope
Inference Optimization
- precompute node embeddings
- choose listwise vs pairwise reranker to trade tokens vs latency
Reproducibility
Data Urls
- STaRK benchmarks (Wu et al., 2024b) - used in evaluation (referenced in paper)
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Runs best with large LLMs; performance and stability depend on LLM choice, especially in complex domains.
- LLM reranking is costly and often dominates runtime and monetary cost.
- Requires a semi-structured KB that links nodes to text; not applicable to pure text or pure KG-only setups without adaptation.
When Not To Use
- When you lack access to sufficiently capable LLMs or cannot bear LLM latency/cost.
- If your data is purely unstructured text without node-to-document links.
- When ultra-low latency (<100ms) is required for each query.
Failure Modes
- Incorrect Cypher generation by the LLM, producing wrong or missing triplets.
- Missing graph links: necessary relations absent in SKB prevent grounding.
- Scope-expansion tuned too large (high l_max) increases false positives.
- Reranker misordering due to LLM hallucination or limited context window.
Core Entities
Models
- GPT OSS 120B
- GPT-5 mini
- GPT OSS 20B
- Ada-002
- Multi-ada-002
Metrics
- hit@1
- hit@5
- hit@20
- recall@20
- mrr
Datasets
- STaRK: PRIME
- STaRK: MAG
- STaRK: AMAZON
Benchmarks
- STaRK

