A practical benchmark showing prompt design and graph encoding matter — LLMs can help on graph tasks but still trail graph models.

May 24, 20237 min

Overview

Production Readiness

0.3

Novelty Score

0.45

Cost Impact Score

0.35

Citation Count

14

Authors

Jiayan Guo, Lun Du, Hengyu Liu, Mengyu Zhou, Xinyi He, Shi Han

Links

Abstract / PDF

Why It Matters For Business

Feeding graph text and simple prompt strategies to LLMs is a cheap way to build KG question answering and automated query generation prototypes, but specialized graph models still give higher accuracy for production.

Summary TLDR

The authors build a framework and benchmark (GUC/GPT4Graph) to test whether LLMs can understand graph data. They convert graphs into text (graph description language), experiment with manual and self-prompting, and evaluate InstructGPT (text‑davinci variants) on 10 tasks spanning structural (degree, diameter, edges) and semantic (KGQA, node/graph classification, query generation). Key findings: giving the raw graph text greatly improves some tasks (e.g., KGQA Wiki: 9.23 → 56.38), one-shot query generation can produce executable Cypher (99% on MetaQA-1hop), and neighborhood context helps node classification (one‑shot 2-hop 60% vs zero‑shot self 48%). Overall LLMs help for prototyping but do <

Problem Statement

Modern LLMs are strong on text but graphs are relational and multi-dimensional. We need to know if and how LLMs can reason over graphs, what input formats and prompts help, and how LLMs compare to graph-specialized methods across common graph tasks.

Main Contribution

A simple pipeline that converts graphs into a text format (graph description language) and feeds them to LLMs with a prompt handler (manual and self-prompting).

A public benchmark covering ten graph tasks (structure + semantic) to evaluate LLM graph understanding.

An empirical study showing which prompt/input designs help and where LLMs still fall short versus graph models; code and data release planned.

Key Findings

Adding the graph text to LLM inputs dramatically improves KGQA on Wiki.

Numberszero-shot 9.23 → zero-shot+graph 56.38

A single example lets the LLM generate executable Cypher queries for 1‑hop QA.

Numbersone-shot Cypher on MetaQA-1hop = 99.00

Local neighborhood text improves node classification accuracy.

Numberszero-shot self 48 → one-shot 2-hop 60 (accuracy)

Input design choices (order, role prompt, examples, CoT) change structural task scores substantially.

Numberssize detection: 1-shot 35.50 → 1-shot-cot 44.00 (+8.5); change-order removal −21.5

Overall, LLMs usually trail specialized models on semantic graph benchmarks.

NumbersSOTA KGQA Wiki 64.70 vs best zero-shot+graph 56.38

Results

Accuracy

Valuezero-shot 9.23%; zero-shot+graph 56.38%

BaselineSOTA 64.70%

Accuracy

Valueone-shot 99.00%

Baselinezero-shot Cypher 30.00%

Accuracy

Valueone-shot, 2-hop context 60%

Baselinezero-shot, self 48%

Structure task sensitivity (size detection)

Value1-shot 35.50% → 1-shot-cot 44.00%

Baseline1-shot 35.50%

Who Should Care

What To Try In 7 Days

Add a compact graph description (edge list or neighbor summaries) to LLM prompts for KGQA and re-run key queries.

Try one-shot Cypher examples to generate executable graph DB queries and validate them in Neo4j.

Use self-prompting to produce a format explanation and 1–2 hop neighborhood summaries before asking the LLM for predictions.

Reproducibility

License

  • CDLA-Permissive-2.0 (data), MIT (code)

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Evaluations use only InstructGPT (text‑davinci variants); results may differ for other LLMs.
  • Subgraphs are small (≈10–20 nodes); behavior on large graphs is not measured.
  • The benchmark and code are pending full public release; some details rely on planned artifacts.

When Not To Use

  • Do not rely on these LLM prompt recipes for production KGQA that needs state‑of‑the‑art accuracy.
  • Avoid for very large graphs where text serialization would be impractical or truncate.
  • Avoid when deterministic, provable graph algorithms (exact diameter) are required.

Failure Modes

  • Numeric and structural computations can be wrong or hallucinated without explicit graph encoding.
  • Performance is sensitive to prompt order, role instructions and example choice.
  • Multi‑hop reasoning degrades quickly unless the graph is provided and carefully summarized.

Core Entities

Models

  • InstructGPT (text-davinci-001)
  • InstructGPT (text-davinci-002)
  • InstructGPT (text-davinci-003)

Metrics

  • ACC

Datasets

  • OGBN-ARXIV
  • Aminer
  • OGBG-MOLHIV
  • OGBG-MOLPCBA
  • Wiki
  • MetaQA

Benchmarks

  • GUC (GPT4Graph benchmark)