A practical benchmark showing prompt design and graph encoding matter — LLMs can help on graph tasks but still trail graph models.

Overview

Decision SnapshotNeeds Validation

The paper provides systematic empirical evidence that prompt and input design strongly affect LLM graph reasoning; results support quick prototyping but show LLMs rarely beat graph‑native models across benchmarks.

Citations14

Evidence Strength0.80

Confidence0.78

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

License: CDLA-Permissive-2.0 (data), MIT (code)

At A Glance

Cost impact: 35%

Production readiness: 30%

Novelty: 45%

Authors

Jiayan Guo, Lun Du, Hengyu Liu, Mengyu Zhou, Xinyi He, Shi Han

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Feeding graph text and simple prompt strategies to LLMs is a cheap way to build KG question answering and automated query generation prototypes, but specialized graph models still give higher accuracy for production.

Who Should Care

Product Manager ML Engineer Data Scientist CTO

Summary TLDR

The authors build a framework and benchmark (GUC/GPT4Graph) to test whether LLMs can understand graph data. They convert graphs into text (graph description language), experiment with manual and self-prompting, and evaluate InstructGPT (text‑davinci variants) on 10 tasks spanning structural (degree, diameter, edges) and semantic (KGQA, node/graph classification, query generation). Key findings: giving the raw graph text greatly improves some tasks (e.g., KGQA Wiki: 9.23 → 56.38), one-shot query generation can produce executable Cypher (99% on MetaQA-1hop), and neighborhood context helps node classification (one‑shot 2-hop 60% vs zero‑shot self 48%). Overall LLMs help for prototyping but do <

Problem Statement

Modern LLMs are strong on text but graphs are relational and multi-dimensional. We need to know if and how LLMs can reason over graphs, what input formats and prompts help, and how LLMs compare to graph-specialized methods across common graph tasks.

Main Contribution

A simple pipeline that converts graphs into a text format (graph description language) and feeds them to LLMs with a prompt handler (manual and self-prompting).

A public benchmark covering ten graph tasks (structure + semantic) to evaluate LLM graph understanding.

Key Findings

Adding the graph text to LLM inputs dramatically improves KGQA on Wiki.

Numberszero-shot 9.23 → zero-shot+graph 56.38

Practical UseAlways include the graph description when using an LLM for KGQA; it can transform near‑random answers into useful results for prototyping.

Evidence RefTable 2 (KGQA Wiki)

A single example lets the LLM generate executable Cypher queries for 1‑hop QA.

Numbersone-shot Cypher on MetaQA-1hop = 99.00

Practical UseUse one-shot examples to get high‑quality query generation from LLMs and run them in a graph DB for fast retrieval.

Evidence RefTable 2 (Cypher Generation)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	zero-shot 9.23%; zero-shot+graph 56.38%	SOTA 64.70%	zero-shot+graph +47.15 pp vs zero-shot	Wiki	Table 2 shows large gain when graph text is added	Table 2
Accuracy	one-shot 99.00%	zero-shot Cypher 30.00%	+69.00 pp	MetaQA-1hop	Table 2 one-shot Cypher reaches 99% executable queries	Table 2

What To Try In 7 Days

Add a compact graph description (edge list or neighbor summaries) to LLM prompts for KGQA and re-run key queries.

Try one-shot Cypher examples to generate executable graph DB queries and validate them in Neo4j.

Use self-prompting to produce a format explanation and 1–2 hop neighborhood summaries before asking the LLM for predictions.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseCDLA-Permissive-2.0 (data), MIT (code)

Code URLs

https://anonymous.4open.science/r/GPT4Graph

Data URLs

https://anonymous.4open.science/r/GPT4Graph

Risks & Boundaries

Limitations

Evaluations use only InstructGPT (text‑davinci variants); results may differ for other LLMs.

Subgraphs are small (≈10–20 nodes); behavior on large graphs is not measured.

When Not To Use

Do not rely on these LLM prompt recipes for production KGQA that needs state‑of‑the‑art accuracy.

Avoid for very large graphs where text serialization would be impractical or truncate.

Failure Modes

Numeric and structural computations can be wrong or hallucinated without explicit graph encoding.

Performance is sensitive to prompt order, role instructions and example choice.

Core Entities

Models

InstructGPT (text-davinci-001)InstructGPT (text-davinci-002)InstructGPT (text-davinci-003)

Metrics

ACC

Datasets

OGBN-ARXIVAminerOGBG-MOLHIVOGBG-MOLPCBAWikiMetaQA

Benchmarks

GUC (GPT4Graph benchmark)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Adding the graph text to LLM inputs dramatically improves KGQA on Wiki.

A single example lets the LLM generate executable Cypher queries for 1‑hop QA.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding