A practical benchmark showing prompt design and graph encoding matter — LLMs can help on graph tasks but still trail graph models.

May 24, 20237 min

Overview

Decision SnapshotNeeds Validation

The paper provides systematic empirical evidence that prompt and input design strongly affect LLM graph reasoning; results support quick prototyping but show LLMs rarely beat graph‑native models across benchmarks.

Citations14

Evidence Strength0.80

Confidence0.78

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

License: CDLA-Permissive-2.0 (data), MIT (code)

At A Glance

Cost impact: 35%

Production readiness: 30%

Novelty: 45%

Authors

Jiayan Guo, Lun Du, Hengyu Liu, Mengyu Zhou, Xinyi He, Shi Han

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Feeding graph text and simple prompt strategies to LLMs is a cheap way to build KG question answering and automated query generation prototypes, but specialized graph models still give higher accuracy for production.

Who Should Care

Summary TLDR

The authors build a framework and benchmark (GUC/GPT4Graph) to test whether LLMs can understand graph data. They convert graphs into text (graph description language), experiment with manual and self-prompting, and evaluate InstructGPT (text‑davinci variants) on 10 tasks spanning structural (degree, diameter, edges) and semantic (KGQA, node/graph classification, query generation). Key findings: giving the raw graph text greatly improves some tasks (e.g., KGQA Wiki: 9.23 → 56.38), one-shot query generation can produce executable Cypher (99% on MetaQA-1hop), and neighborhood context helps node classification (one‑shot 2-hop 60% vs zero‑shot self 48%). Overall LLMs help for prototyping but do <

Problem Statement

Modern LLMs are strong on text but graphs are relational and multi-dimensional. We need to know if and how LLMs can reason over graphs, what input formats and prompts help, and how LLMs compare to graph-specialized methods across common graph tasks.

Main Contribution

A simple pipeline that converts graphs into a text format (graph description language) and feeds them to LLMs with a prompt handler (manual and self-prompting).

A public benchmark covering ten graph tasks (structure + semantic) to evaluate LLM graph understanding.

Key Findings

Adding the graph text to LLM inputs dramatically improves KGQA on Wiki.

Numberszero-shot 9.23 → zero-shot+graph 56.38

Practical UseAlways include the graph description when using an LLM for KGQA; it can transform near‑random answers into useful results for prototyping.

Evidence RefTable 2 (KGQA Wiki)

A single example lets the LLM generate executable Cypher queries for 1‑hop QA.

Numbersone-shot Cypher on MetaQA-1hop = 99.00

Practical UseUse one-shot examples to get high‑quality query generation from LLMs and run them in a graph DB for fast retrieval.

Evidence RefTable 2 (Cypher Generation)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracyzero-shot 9.23%; zero-shot+graph 56.38%SOTA 64.70%zero-shot+graph +47.15 pp vs zero-shotWikiTable 2 shows large gain when graph text is addedTable 2
Accuracyone-shot 99.00%zero-shot Cypher 30.00%+69.00 ppMetaQA-1hopTable 2 one-shot Cypher reaches 99% executable queriesTable 2

What To Try In 7 Days

Add a compact graph description (edge list or neighbor summaries) to LLM prompts for KGQA and re-run key queries.

Try one-shot Cypher examples to generate executable graph DB queries and validate them in Neo4j.

Use self-prompting to produce a format explanation and 1–2 hop neighborhood summaries before asking the LLM for predictions.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseCDLA-Permissive-2.0 (data), MIT (code)

Risks & Boundaries

Limitations

Evaluations use only InstructGPT (text‑davinci variants); results may differ for other LLMs.

Subgraphs are small (≈10–20 nodes); behavior on large graphs is not measured.

When Not To Use

Do not rely on these LLM prompt recipes for production KGQA that needs state‑of‑the‑art accuracy.

Avoid for very large graphs where text serialization would be impractical or truncate.

Failure Modes

Numeric and structural computations can be wrong or hallucinated without explicit graph encoding.

Performance is sensitive to prompt order, role instructions and example choice.

Core Entities

Models

InstructGPT (text-davinci-001)InstructGPT (text-davinci-002)InstructGPT (text-davinci-003)

Metrics

ACC

Datasets

OGBN-ARXIVAminerOGBG-MOLHIVOGBG-MOLPCBAWikiMetaQA

Benchmarks

GUC (GPT4Graph benchmark)