Overview
The paper provides systematic empirical evidence that prompt and input design strongly affect LLM graph reasoning; results support quick prototyping but show LLMs rarely beat graph‑native models across benchmarks.
Citations14
Evidence Strength0.80
Confidence0.78
Risk Signals9
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 4/4
Reproducibility
Status: Code + data available
Open source: Partial
License: CDLA-Permissive-2.0 (data), MIT (code)
At A Glance
Cost impact: 35%
Production readiness: 30%
Novelty: 45%
Why It Matters For Business
Feeding graph text and simple prompt strategies to LLMs is a cheap way to build KG question answering and automated query generation prototypes, but specialized graph models still give higher accuracy for production.
Who Should Care
Summary TLDR
The authors build a framework and benchmark (GUC/GPT4Graph) to test whether LLMs can understand graph data. They convert graphs into text (graph description language), experiment with manual and self-prompting, and evaluate InstructGPT (text‑davinci variants) on 10 tasks spanning structural (degree, diameter, edges) and semantic (KGQA, node/graph classification, query generation). Key findings: giving the raw graph text greatly improves some tasks (e.g., KGQA Wiki: 9.23 → 56.38), one-shot query generation can produce executable Cypher (99% on MetaQA-1hop), and neighborhood context helps node classification (one‑shot 2-hop 60% vs zero‑shot self 48%). Overall LLMs help for prototyping but do <
Problem Statement
Modern LLMs are strong on text but graphs are relational and multi-dimensional. We need to know if and how LLMs can reason over graphs, what input formats and prompts help, and how LLMs compare to graph-specialized methods across common graph tasks.
Main Contribution
A simple pipeline that converts graphs into a text format (graph description language) and feeds them to LLMs with a prompt handler (manual and self-prompting).
A public benchmark covering ten graph tasks (structure + semantic) to evaluate LLM graph understanding.
Key Findings
Adding the graph text to LLM inputs dramatically improves KGQA on Wiki.
A single example lets the LLM generate executable Cypher queries for 1‑hop QA.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | zero-shot 9.23%; zero-shot+graph 56.38% | SOTA 64.70% | zero-shot+graph +47.15 pp vs zero-shot | Wiki | Table 2 shows large gain when graph text is added | Table 2 |
| Accuracy | one-shot 99.00% | zero-shot Cypher 30.00% | +69.00 pp | MetaQA-1hop | Table 2 one-shot Cypher reaches 99% executable queries | Table 2 |
What To Try In 7 Days
Add a compact graph description (edge list or neighbor summaries) to LLM prompts for KGQA and re-run key queries.
Try one-shot Cypher examples to generate executable graph DB queries and validate them in Neo4j.
Use self-prompting to produce a format explanation and 1–2 hop neighborhood summaries before asking the LLM for predictions.
Reproducibility
Risks & Boundaries
Limitations
Evaluations use only InstructGPT (text‑davinci variants); results may differ for other LLMs.
Subgraphs are small (≈10–20 nodes); behavior on large graphs is not measured.
When Not To Use
Do not rely on these LLM prompt recipes for production KGQA that needs state‑of‑the‑art accuracy.
Avoid for very large graphs where text serialization would be impractical or truncate.
Failure Modes
Numeric and structural computations can be wrong or hallucinated without explicit graph encoding.
Performance is sensitive to prompt order, role instructions and example choice.

