Overview
The benchmark is well documented and evaluated across models and humans, showing robust evidence that mixed textual+relational queries are hard; however production use requires latency and privacy trade-offs.
Citations4
Evidence Strength0.90
Confidence0.85
Risk Signals9
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 4/5
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 55%
Production readiness: 60%
Novelty: 70%
Why It Matters For Business
Search and recommendation systems often need to reason over both product text and structured relationships; STARK shows many current retrievers miss important multi-hop or relational signals, so products relying on naive retrieval risk poor search quality or unsafe omissions.
Who Should Care
Summary TLDR
STARK is a new, large benchmark for retrieval on semi-structured knowledge bases (SKBs) that combine textual node documents with graph relations. The authors build three public SKBs (Amazon products, academic papers, biomedical PrimeKG), synthesize diverse multi-hop natural-language queries with an automatic pipeline, validate query quality with humans, add 263 human queries, and run wide baselines. Results show simple BM25 and multivector methods remain strong, LLM-based rerankers raise top-rank accuracy but are still far from complete and are expensive in latency. STARK exposes clear gaps in current retrievers for mixed textual+relational search.
Problem Statement
Real user queries often mix free-form text and graph relations (e.g., “product by brand X that matches feature Y” or “papers from institution A on topic B”). Existing benchmarks study text or graphs separately. We lack a large, realistic testbed to measure how well retrieval systems — especially LLM-driven ones — handle both textual and relational requirements on large SKBs.
Main Contribution
Three large semi-structured knowledge bases (SKBs): STARK-AMAZON, STARK-MAG, STARK-PRIME combining node text and graph relations.
An automatic four-step pipeline to synthesize natural, role-specific multi-hop queries that entangle relational and textual constraints and filter ground-truth answers with LLM verification.
Key Findings
Classic sparse baseline (BM25) is still competitive and often outperforms small dense retrievers on STARK.
LLM rerankers (GPT‑4 / Claude3) improve top-rank accuracy but still miss many relevant items.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| BM25 Hit@1 (STARK-AMAZON, synthesized) | 44.94 | — | — | STARK-AMAZON (synth) | Table 6 full test | Table 6 |
| DPR Hit@1 (STARK-AMAZON, synthesized) | 15.29 | BM25 44.94 | -29.65 pp | STARK-AMAZON (synth) | Table 6 full test | Table 6 |
What To Try In 7 Days
Run BM25 and a multivector retriever baseline on your SKB; compare to any dense retriever.
Add an LLM reranker for top-k results and measure latency/cost vs precision gains.
Inspect failure cases on your domain-specific SKB and add manual rules or filters for relational constraints.
Reproducibility
Risks & Boundaries
Limitations
SKBs only cover textual and relational data; no images, audio, or other modalities.
Synthesized queries rely on LLMs for generation and filtering which may inherit model biases.
When Not To Use
If your retrieval problem is purely unstructured text without graph relations.
When ultra-low latency (<1s) is mandatory and you cannot afford reranker costs.
Failure Modes
Dense retrievers over- or under-emphasize repeated keywords and miss relational constraints.
LLM rerankers increase top precision but still have low recall and can be confidently wrong.

