Overview
Production Readiness
0.3
Novelty Score
0.45
Cost Impact Score
0.35
Citation Count
14
Why It Matters For Business
Feeding graph text and simple prompt strategies to LLMs is a cheap way to build KG question answering and automated query generation prototypes, but specialized graph models still give higher accuracy for production.
Summary TLDR
The authors build a framework and benchmark (GUC/GPT4Graph) to test whether LLMs can understand graph data. They convert graphs into text (graph description language), experiment with manual and self-prompting, and evaluate InstructGPT (text‑davinci variants) on 10 tasks spanning structural (degree, diameter, edges) and semantic (KGQA, node/graph classification, query generation). Key findings: giving the raw graph text greatly improves some tasks (e.g., KGQA Wiki: 9.23 → 56.38), one-shot query generation can produce executable Cypher (99% on MetaQA-1hop), and neighborhood context helps node classification (one‑shot 2-hop 60% vs zero‑shot self 48%). Overall LLMs help for prototyping but do <
Problem Statement
Modern LLMs are strong on text but graphs are relational and multi-dimensional. We need to know if and how LLMs can reason over graphs, what input formats and prompts help, and how LLMs compare to graph-specialized methods across common graph tasks.
Main Contribution
A simple pipeline that converts graphs into a text format (graph description language) and feeds them to LLMs with a prompt handler (manual and self-prompting).
A public benchmark covering ten graph tasks (structure + semantic) to evaluate LLM graph understanding.
An empirical study showing which prompt/input designs help and where LLMs still fall short versus graph models; code and data release planned.
Key Findings
Adding the graph text to LLM inputs dramatically improves KGQA on Wiki.
A single example lets the LLM generate executable Cypher queries for 1‑hop QA.
Local neighborhood text improves node classification accuracy.
Input design choices (order, role prompt, examples, CoT) change structural task scores substantially.
Overall, LLMs usually trail specialized models on semantic graph benchmarks.
Results
Accuracy
Accuracy
Accuracy
Structure task sensitivity (size detection)
Who Should Care
What To Try In 7 Days
Add a compact graph description (edge list or neighbor summaries) to LLM prompts for KGQA and re-run key queries.
Try one-shot Cypher examples to generate executable graph DB queries and validate them in Neo4j.
Use self-prompting to produce a format explanation and 1–2 hop neighborhood summaries before asking the LLM for predictions.
Reproducibility
License
- CDLA-Permissive-2.0 (data), MIT (code)
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Evaluations use only InstructGPT (text‑davinci variants); results may differ for other LLMs.
- Subgraphs are small (≈10–20 nodes); behavior on large graphs is not measured.
- The benchmark and code are pending full public release; some details rely on planned artifacts.
When Not To Use
- Do not rely on these LLM prompt recipes for production KGQA that needs state‑of‑the‑art accuracy.
- Avoid for very large graphs where text serialization would be impractical or truncate.
- Avoid when deterministic, provable graph algorithms (exact diameter) are required.
Failure Modes
- Numeric and structural computations can be wrong or hallucinated without explicit graph encoding.
- Performance is sensitive to prompt order, role instructions and example choice.
- Multi‑hop reasoning degrades quickly unless the graph is provided and carefully summarized.
Core Entities
Models
- InstructGPT (text-davinci-001)
- InstructGPT (text-davinci-002)
- InstructGPT (text-davinci-003)
Metrics
- ACC
Datasets
- OGBN-ARXIV
- Aminer
- OGBG-MOLHIV
- OGBG-MOLPCBA
- Wiki
- MetaQA
Benchmarks
- GUC (GPT4Graph benchmark)

