STARK: a large benchmark testing LLM-based retrieval on semi-structured knowledge (text + graph)

April 19, 20248 min

Overview

Decision SnapshotReady For Pilot

The benchmark is well documented and evaluated across models and humans, showing robust evidence that mixed textual+relational queries are hard; however production use requires latency and privacy trade-offs.

Citations4

Evidence Strength0.90

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 55%

Production readiness: 60%

Novelty: 70%

Authors

Shirley Wu, Shiyu Zhao, Michihiro Yasunaga, Kexin Huang, Kaidi Cao, Qian Huang, Vassilis N. Ioannidis, Karthik Subbian, James Zou, Jure Leskovec

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Search and recommendation systems often need to reason over both product text and structured relationships; STARK shows many current retrievers miss important multi-hop or relational signals, so products relying on naive retrieval risk poor search quality or unsafe omissions.

Who Should Care

Summary TLDR

STARK is a new, large benchmark for retrieval on semi-structured knowledge bases (SKBs) that combine textual node documents with graph relations. The authors build three public SKBs (Amazon products, academic papers, biomedical PrimeKG), synthesize diverse multi-hop natural-language queries with an automatic pipeline, validate query quality with humans, add 263 human queries, and run wide baselines. Results show simple BM25 and multivector methods remain strong, LLM-based rerankers raise top-rank accuracy but are still far from complete and are expensive in latency. STARK exposes clear gaps in current retrievers for mixed textual+relational search.

Problem Statement

Real user queries often mix free-form text and graph relations (e.g., “product by brand X that matches feature Y” or “papers from institution A on topic B”). Existing benchmarks study text or graphs separately. We lack a large, realistic testbed to measure how well retrieval systems — especially LLM-driven ones — handle both textual and relational requirements on large SKBs.

Main Contribution

Three large semi-structured knowledge bases (SKBs): STARK-AMAZON, STARK-MAG, STARK-PRIME combining node text and graph relations.

An automatic four-step pipeline to synthesize natural, role-specific multi-hop queries that entangle relational and textual constraints and filter ground-truth answers with LLM verification.

Key Findings

Classic sparse baseline (BM25) is still competitive and often outperforms small dense retrievers on STARK.

NumbersSTARK-AMAZON (synth): BM25 Hit@1 44.94 vs DPR Hit@1 15.29 (Table 6)

Practical UseStart with BM25 or tuned sparse methods before investing in dense retriever training for large SKBs; dense models may need larger capacity or different training to beat sparse baselines.

Evidence RefTable 6

LLM rerankers (GPT‑4 / Claude3) improve top-rank accuracy but still miss many relevant items.

NumbersSynthesized STARK-AMAZON: GPT‑4 Reranker Hit@1 44.79, R@20 55.35; STARK-PRIME R@20 34.05 (Table 6)

Practical UseUse LLM reranking to lift precision at top results, but expect incomplete recall and budget for substantial compute and latency costs.

Evidence RefTable 6

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
BM25 Hit@1 (STARK-AMAZON, synthesized)44.94STARK-AMAZON (synth)Table 6 full testTable 6
DPR Hit@1 (STARK-AMAZON, synthesized)15.29BM25 44.94-29.65 ppSTARK-AMAZON (synth)Table 6 full testTable 6

What To Try In 7 Days

Run BM25 and a multivector retriever baseline on your SKB; compare to any dense retriever.

Add an LLM reranker for top-k results and measure latency/cost vs precision gains.

Inspect failure cases on your domain-specific SKB and add manual rules or filters for relational constraints.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

SKBs only cover textual and relational data; no images, audio, or other modalities.

Synthesized queries rely on LLMs for generation and filtering which may inherit model biases.

When Not To Use

If your retrieval problem is purely unstructured text without graph relations.

When ultra-low latency (<1s) is mandatory and you cannot afford reranker costs.

Failure Modes

Dense retrievers over- or under-emphasize repeated keywords and miss relational constraints.

LLM rerankers increase top precision but still have low recall and can be confidently wrong.

Core Entities

Models

BM25DPRANCEQAGNNtext-embedding-ada-002voyage-l2-instructLLM2VecGritLM-7bmulti-ada-002ColBERTv2GPT-4 (gpt-4-1106-preview)Claude3 (claude-3-opus)

Metrics

Hit@1Hit@5Recall@20MRRLatency (s)

Datasets

STARK-AMAZONSTARK-MAGSTARK-PRIMEAmazon Product ReviewsAmazon Q&Aogbn-mag / obgn-papers100MPrimeKG

Benchmarks

STARK