Overview
Production Readiness
0.6
Novelty Score
0.7
Cost Impact Score
0.55
Citation Count
4
Why It Matters For Business
Search and recommendation systems often need to reason over both product text and structured relationships; STARK shows many current retrievers miss important multi-hop or relational signals, so products relying on naive retrieval risk poor search quality or unsafe omissions.
Summary TLDR
STARK is a new, large benchmark for retrieval on semi-structured knowledge bases (SKBs) that combine textual node documents with graph relations. The authors build three public SKBs (Amazon products, academic papers, biomedical PrimeKG), synthesize diverse multi-hop natural-language queries with an automatic pipeline, validate query quality with humans, add 263 human queries, and run wide baselines. Results show simple BM25 and multivector methods remain strong, LLM-based rerankers raise top-rank accuracy but are still far from complete and are expensive in latency. STARK exposes clear gaps in current retrievers for mixed textual+relational search.
Problem Statement
Real user queries often mix free-form text and graph relations (e.g., “product by brand X that matches feature Y” or “papers from institution A on topic B”). Existing benchmarks study text or graphs separately. We lack a large, realistic testbed to measure how well retrieval systems — especially LLM-driven ones — handle both textual and relational requirements on large SKBs.
Main Contribution
Three large semi-structured knowledge bases (SKBs): STARK-AMAZON, STARK-MAG, STARK-PRIME combining node text and graph relations.
An automatic four-step pipeline to synthesize natural, role-specific multi-hop queries that entangle relational and textual constraints and filter ground-truth answers with LLM verification.
A human-validated set of 263 human-generated queries and human evaluation showing high naturalness/diversity/practicality.
Comprehensive baseline evaluation across sparse, dense, multivector retrievers and LLM rerankers, plus latency measurements and analysis.
Key Findings
Classic sparse baseline (BM25) is still competitive and often outperforms small dense retrievers on STARK.
LLM rerankers (GPT‑4 / Claude3) improve top-rank accuracy but still miss many relevant items.
Best systems leave substantial gaps on recall and top accuracy, especially on biomedical SKB.
LLM rerankers add large latency compared with compact retrievers.
Results
BM25 Hit@1 (STARK-AMAZON, synthesized)
DPR Hit@1 (STARK-AMAZON, synthesized)
GPT‑4 Reranker Recall@20 (STARK-AMAZON, synthesized)
Claude3 Reranker Hit@1 (STARK-AMAZON, human-generated)
Latency (avg) of GPT‑4/Claude3 rerankers
Who Should Care
What To Try In 7 Days
Run BM25 and a multivector retriever baseline on your SKB; compare to any dense retriever.
Add an LLM reranker for top-k results and measure latency/cost vs precision gains.
Inspect failure cases on your domain-specific SKB and add manual rules or filters for relational constraints.
Reproducibility
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- SKBs only cover textual and relational data; no images, audio, or other modalities.
- Synthesized queries rely on LLMs for generation and filtering which may inherit model biases.
- Human-generated queries are limited (263 total) and may not cover all real-world linguistic diversity.
When Not To Use
- If your retrieval problem is purely unstructured text without graph relations.
- When ultra-low latency (<1s) is mandatory and you cannot afford reranker costs.
- For domains where private data cannot be shared; even anonymized public sources may not match private SKBs.
Failure Modes
- Dense retrievers over- or under-emphasize repeated keywords and miss relational constraints.
- LLM rerankers increase top precision but still have low recall and can be confidently wrong.
- Synthesized queries can miss idiomatic or emerging language patterns compared to broad user queries.
Core Entities
Models
- BM25
- DPR
- ANCE
- QAGNN
- text-embedding-ada-002
- voyage-l2-instruct
- LLM2Vec
- GritLM-7b
- multi-ada-002
- ColBERTv2
- GPT-4 (gpt-4-1106-preview)
- Claude3 (claude-3-opus)
Metrics
- Hit@1
- Hit@5
- Recall@20
- MRR
- Latency (s)
Datasets
- STARK-AMAZON
- STARK-MAG
- STARK-PRIME
- Amazon Product Reviews
- Amazon Q&A
- ogbn-mag / obgn-papers100M
- PrimeKG
Benchmarks
- STARK

