STARK: a large benchmark testing LLM-based retrieval on semi-structured knowledge (text + graph)

Overview

Decision SnapshotReady For Pilot

The benchmark is well documented and evaluated across models and humans, showing robust evidence that mixed textual+relational queries are hard; however production use requires latency and privacy trade-offs.

Citations4

Evidence Strength0.90

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 55%

Production readiness: 60%

Novelty: 70%

Authors

Shirley Wu, Shiyu Zhao, Michihiro Yasunaga, Kexin Huang, Kaidi Cao, Qian Huang, Vassilis N. Ioannidis, Karthik Subbian, James Zou, Jure Leskovec

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Search and recommendation systems often need to reason over both product text and structured relationships; STARK shows many current retrievers miss important multi-hop or relational signals, so products relying on naive retrieval risk poor search quality or unsafe omissions.

Who Should Care

Product Manager ML Engineer Data Scientist CTO

Summary TLDR

STARK is a new, large benchmark for retrieval on semi-structured knowledge bases (SKBs) that combine textual node documents with graph relations. The authors build three public SKBs (Amazon products, academic papers, biomedical PrimeKG), synthesize diverse multi-hop natural-language queries with an automatic pipeline, validate query quality with humans, add 263 human queries, and run wide baselines. Results show simple BM25 and multivector methods remain strong, LLM-based rerankers raise top-rank accuracy but are still far from complete and are expensive in latency. STARK exposes clear gaps in current retrievers for mixed textual+relational search.

Problem Statement

Real user queries often mix free-form text and graph relations (e.g., “product by brand X that matches feature Y” or “papers from institution A on topic B”). Existing benchmarks study text or graphs separately. We lack a large, realistic testbed to measure how well retrieval systems — especially LLM-driven ones — handle both textual and relational requirements on large SKBs.

Main Contribution

Three large semi-structured knowledge bases (SKBs): STARK-AMAZON, STARK-MAG, STARK-PRIME combining node text and graph relations.

An automatic four-step pipeline to synthesize natural, role-specific multi-hop queries that entangle relational and textual constraints and filter ground-truth answers with LLM verification.

Key Findings

Classic sparse baseline (BM25) is still competitive and often outperforms small dense retrievers on STARK.

NumbersSTARK-AMAZON (synth): BM25 Hit@1 44.94 vs DPR Hit@1 15.29 (Table 6)

Practical UseStart with BM25 or tuned sparse methods before investing in dense retriever training for large SKBs; dense models may need larger capacity or different training to beat sparse baselines.

Evidence RefTable 6

LLM rerankers (GPT‑4 / Claude3) improve top-rank accuracy but still miss many relevant items.

NumbersSynthesized STARK-AMAZON: GPT‑4 Reranker Hit@1 44.79, R@20 55.35; STARK-PRIME R@20 34.05 (Table 6)

Practical UseUse LLM reranking to lift precision at top results, but expect incomplete recall and budget for substantial compute and latency costs.

Evidence RefTable 6

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
BM25 Hit@1 (STARK-AMAZON, synthesized)	44.94	—	—	STARK-AMAZON (synth)	Table 6 full test	Table 6
DPR Hit@1 (STARK-AMAZON, synthesized)	15.29	BM25 44.94	-29.65 pp	STARK-AMAZON (synth)	Table 6 full test	Table 6

What To Try In 7 Days

Run BM25 and a multivector retriever baseline on your SKB; compare to any dense retriever.

Add an LLM reranker for top-k results and measure latency/cost vs precision gains.

Inspect failure cases on your domain-specific SKB and add manual rules or filters for relational constraints.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://stark.stanford.edu/https://github.com/snap-stanford/STaRK

Data URLs

https://stark.stanford.edu/skb_explorer.html

Risks & Boundaries

Limitations

SKBs only cover textual and relational data; no images, audio, or other modalities.

Synthesized queries rely on LLMs for generation and filtering which may inherit model biases.

When Not To Use

If your retrieval problem is purely unstructured text without graph relations.

When ultra-low latency (<1s) is mandatory and you cannot afford reranker costs.

Failure Modes

Dense retrievers over- or under-emphasize repeated keywords and miss relational constraints.

LLM rerankers increase top precision but still have low recall and can be confidently wrong.

Core Entities

Models

BM25DPRANCEQAGNNtext-embedding-ada-002voyage-l2-instructLLM2VecGritLM-7bmulti-ada-002ColBERTv2GPT-4 (gpt-4-1106-preview)Claude3 (claude-3-opus)

Metrics

Hit@1Hit@5Recall@20MRRLatency (s)

Datasets

STARK-AMAZONSTARK-MAGSTARK-PRIMEAmazon Product ReviewsAmazon Q&Aogbn-mag / obgn-papers100MPrimeKG

Benchmarks

STARK

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Classic sparse baseline (BM25) is still competitive and often outperforms small dense retrievers on STARK.

LLM rerankers (GPT‑4 / Claude3) improve top-rank accuracy but still miss many relevant items.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding