STARK: a large benchmark testing LLM-based retrieval on semi-structured knowledge (text + graph)

April 19, 20248 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.55

Citation Count

4

Authors

Shirley Wu, Shiyu Zhao, Michihiro Yasunaga, Kexin Huang, Kaidi Cao, Qian Huang, Vassilis N. Ioannidis, Karthik Subbian, James Zou, Jure Leskovec

Links

Abstract / PDF

Why It Matters For Business

Search and recommendation systems often need to reason over both product text and structured relationships; STARK shows many current retrievers miss important multi-hop or relational signals, so products relying on naive retrieval risk poor search quality or unsafe omissions.

Summary TLDR

STARK is a new, large benchmark for retrieval on semi-structured knowledge bases (SKBs) that combine textual node documents with graph relations. The authors build three public SKBs (Amazon products, academic papers, biomedical PrimeKG), synthesize diverse multi-hop natural-language queries with an automatic pipeline, validate query quality with humans, add 263 human queries, and run wide baselines. Results show simple BM25 and multivector methods remain strong, LLM-based rerankers raise top-rank accuracy but are still far from complete and are expensive in latency. STARK exposes clear gaps in current retrievers for mixed textual+relational search.

Problem Statement

Real user queries often mix free-form text and graph relations (e.g., “product by brand X that matches feature Y” or “papers from institution A on topic B”). Existing benchmarks study text or graphs separately. We lack a large, realistic testbed to measure how well retrieval systems — especially LLM-driven ones — handle both textual and relational requirements on large SKBs.

Main Contribution

Three large semi-structured knowledge bases (SKBs): STARK-AMAZON, STARK-MAG, STARK-PRIME combining node text and graph relations.

An automatic four-step pipeline to synthesize natural, role-specific multi-hop queries that entangle relational and textual constraints and filter ground-truth answers with LLM verification.

A human-validated set of 263 human-generated queries and human evaluation showing high naturalness/diversity/practicality.

Comprehensive baseline evaluation across sparse, dense, multivector retrievers and LLM rerankers, plus latency measurements and analysis.

Key Findings

Classic sparse baseline (BM25) is still competitive and often outperforms small dense retrievers on STARK.

NumbersSTARK-AMAZON (synth): BM25 Hit@1 44.94 vs DPR Hit@1 15.29 (Table 6)

LLM rerankers (GPT‑4 / Claude3) improve top-rank accuracy but still miss many relevant items.

NumbersSynthesized STARK-AMAZON: GPT‑4 Reranker Hit@1 44.79, R@20 55.35; STARK-PRIME R@20 34.05 (Table 6)

Best systems leave substantial gaps on recall and top accuracy, especially on biomedical SKB.

NumbersBest R@20 across datasets < 60% (GPT‑4 R@20: 55% Amazon, 49% MAG, 34% PRIME; Hit@1 on PRIME ~18%)

LLM rerankers add large latency compared with compact retrievers.

NumbersAverage latencies: DPR 1.4s, ada-002 2.83s, GPT‑4 reranker ~25s (Table 8)

Results

BM25 Hit@1 (STARK-AMAZON, synthesized)

Value44.94

DPR Hit@1 (STARK-AMAZON, synthesized)

Value15.29

BaselineBM25 44.94

GPT‑4 Reranker Recall@20 (STARK-AMAZON, synthesized)

Value55.35

Baselineada-002 R@20 53.29

Claude3 Reranker Hit@1 (STARK-AMAZON, human-generated)

Value53.09

BaselineBM25 27.16

Latency (avg) of GPT‑4/Claude3 rerankers

Value25.05–26.33 s

BaselineDPR avg 1.4 s

Who Should Care

What To Try In 7 Days

Run BM25 and a multivector retriever baseline on your SKB; compare to any dense retriever.

Add an LLM reranker for top-k results and measure latency/cost vs precision gains.

Inspect failure cases on your domain-specific SKB and add manual rules or filters for relational constraints.

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • SKBs only cover textual and relational data; no images, audio, or other modalities.
  • Synthesized queries rely on LLMs for generation and filtering which may inherit model biases.
  • Human-generated queries are limited (263 total) and may not cover all real-world linguistic diversity.

When Not To Use

  • If your retrieval problem is purely unstructured text without graph relations.
  • When ultra-low latency (<1s) is mandatory and you cannot afford reranker costs.
  • For domains where private data cannot be shared; even anonymized public sources may not match private SKBs.

Failure Modes

  • Dense retrievers over- or under-emphasize repeated keywords and miss relational constraints.
  • LLM rerankers increase top precision but still have low recall and can be confidently wrong.
  • Synthesized queries can miss idiomatic or emerging language patterns compared to broad user queries.

Core Entities

Models

  • BM25
  • DPR
  • ANCE
  • QAGNN
  • text-embedding-ada-002
  • voyage-l2-instruct
  • LLM2Vec
  • GritLM-7b
  • multi-ada-002
  • ColBERTv2
  • GPT-4 (gpt-4-1106-preview)
  • Claude3 (claude-3-opus)

Metrics

  • Hit@1
  • Hit@5
  • Recall@20
  • MRR
  • Latency (s)

Datasets

  • STARK-AMAZON
  • STARK-MAG
  • STARK-PRIME
  • Amazon Product Reviews
  • Amazon Q&A
  • ogbn-mag / obgn-papers100M
  • PrimeKG

Benchmarks

  • STARK