LLMs generate, explain and iteratively fix Cypher queries so non-experts can ask graph databases in plain English

Overview

Decision SnapshotNeeds Validation

The architecture is a practical prototype: code is available and experiments are statistically grounded, but datasets and question sets are small and domain shift remains a risk.

Citations0

Evidence Strength0.70

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 0/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 60%

Authors

Larissa Pusch, Alexandre Courtiol, Tim Conrad

Links

Abstract / PDF / Code / Data

Why It Matters For Business

This pipeline lets non-programmers query curated Knowledge Graphs with auditable Cypher and guided edits, reducing developer bottlenecks and increasing trust for knowledge-intensive applications.

Who Should Care

Product Manager ML Engineer Data Scientist Engineering Lead

Summary TLDR

The authors build an interactive pipeline where an LLM generates executable Cypher queries for a Knowledge Graph, explains the query in plain language, and accepts natural-language amendments to edit the query. They evaluate the explanation and fault-detection abilities on a 90-query synthetic movie KG and test query-generation on two real KGs (MaRDI and a Hyena research KG). Some models exceed 70% one-sentence explanation accuracy and a few combine high fault-detection with low false positives. Domain shift matters: many models do well on the MaRDI software KG but fewer succeed consistently on expert hyena questions.

Problem Statement

LLMs are fluent but can hallucinate, be outdated, and hide reasoning. Text-based RAG struggles with multi-hop queries. Knowledge Graphs are precise but require learning query languages (Cypher/SPARQL), creating a usability gap for non-experts who need verifiable, multi-step answers.

Main Contribution

An end-to-end, modular pipeline where an LLM generates Cypher queries, explains them in plain language, executes them on Neo4j, and edits them via natural-language amendments.

A controlled 90-query benchmark (synthetic movie KG) to measure explanation quality and fault detection across multiple LLMs, plus two real-world case studies on MaRDI and a Hyena KG.

Key Findings

Some LLMs produce correct and complete one-sentence explanations on the synthetic 90-query benchmark.

Numberso1-preview, deepseek-reasoner-api, o3-mini ≥70% accuracy (n=90).

Practical UsePick models like o1-preview or deepseek-reasoner-api for explanation-first workflows; expect roughly 70–77% correct short explanations on similar, controlled KG queries.

Evidence RefTable 2; Sec.3.5.1

Certain models reliably detect injected query faults while avoiding false alarms.

Numberso1-preview and deepseek-reasoner-api flagged faults >85% and had false-positive rates <10% (n=75 perturbed queries).

Practical UseUse these models when you need automatic sanity checks on generated Cypher; they balance sensitivity and specificity better than others.

Evidence RefTable 2; Sec.3.5.2 & 3.5.3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	o1-preview 76.6%, deepseek-reasoner-api 73.3%, o3-mini 71.1%, deepseek-r1:70b 66.6%, claude 52.2% (n=90)	—	—	Synthetic movie KG (90 queries)	Table 2 shows per-model accuracies for explanation summaries	Table 2
Problem (perturbation) detection rate	o1-preview 88.0%, deepseek-reasoner-api 89.3%, claude 85.3%, o3-mini 77.3%, deepseek-r1:70b 68.0% (n=75)	—	—	Perturbed queries (synthetic benchmark)	Table 2 reports detection rates per model	Table 2

What To Try In 7 Days

Wire a simple LangChain + Neo4j flow and feed an LLM the KG schema to generate Cypher for 10 representative questions.

Add an explanation prompt that outputs a one-sentence summary plus flagged issues for each generated query.

Run a small perturbation test (flip relation, wrong node type) to measure your model's fault-detection vs false-positive tradeoff.

Agent Features

Memory

No persistent long-term memory; uses schema and current query context

Tool Use

Cypher query generationNeo4j executionLangChain prompts

Frameworks

LangChain

Is Agentic

Yes

Architectures

LLM-centered modular pipeline

Collaboration

Human-in-the-loop (iterative amendment)

Optimization Features

Token Efficiency

Schema-aware prompting to reduce irrelevant tokens

System Optimization

Post-processing to remove LLM artifacts and ensure valid Cypher

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://git.zib.de/lpusch/talk2cypher

Data URLs

MaRDI public SPARQL/REST endpoints (paper states MaRDI exposes public endpoints)https://hyena-project.com (Hyena project reference; access controlled)

Risks & Boundaries

Limitations

Synthetic benchmark is small (90 queries) and may not capture real-world schema complexity.

MaRDI and Hyena experiments use small, curated question sets (9 and 5 questions), limiting generalizability.

When Not To Use

When you need fully automated, zero-human intervention workflows.

For extremely large or rapidly changing graphs without schema constraints.

Failure Modes

One-sentence explanations drop numeric/time constraints (years omitted), harming faithfulness.

Flipped relationship directions and contradictory WHERE clauses can completely mislead the model.

Core Entities

Models

o1-previewo3-minideepseek-reasoner-apideepseek-r1:70bclaude-3.7-sonnetgpt-5.2o1o3o4-minigemma3qwqphi4llama3.3nemotron

Metrics

Accuracyproblem detection ratefalse positive rate

Datasets

Synthetic Movie KG (90-query benchmark)MaRDI KG subgraph (software/publication slice)Hyena KG (Ngorongoro field data)

Benchmarks

90-query Cypher explanation benchmark (movie KG)MaRDI 9-question query setHyena 5-question expert set

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Some LLMs produce correct and complete one-sentence explanations on the synthetic 90-query benchmark.

Certain models reliably detect injected query faults while avoiding false alarms.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Bi'an: a bilingual RAG hallucination benchmark plus small fine-tuned judge models

Key finding

MultiHal: a multilingual, Wikidata-grounded benchmark that uses KG paths to evaluate and reduce LLM hallucinations

Key finding

DiaHalu: 1,103 multi-turn dialogues to test hallucination in chat-style LLMs

Key finding

An open leaderboard that measures LLM hallucinations across 15 tasks and 20 models

Key finding

LLMs (GPT-3.5, GPT-4, PaLM-2) do not reliably judge factuality on the FRANK benchmark

Key finding