LLMs generate, explain and iteratively fix Cypher queries so non-experts can ask graph databases in plain English

February 5, 20268 min

Overview

Decision SnapshotNeeds Validation

The architecture is a practical prototype: code is available and experiments are statistically grounded, but datasets and question sets are small and domain shift remains a risk.

Citations0

Evidence Strength0.70

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 0/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 60%

Authors

Larissa Pusch, Alexandre Courtiol, Tim Conrad

Links

Abstract / PDF / Code / Data

Why It Matters For Business

This pipeline lets non-programmers query curated Knowledge Graphs with auditable Cypher and guided edits, reducing developer bottlenecks and increasing trust for knowledge-intensive applications.

Who Should Care

Summary TLDR

The authors build an interactive pipeline where an LLM generates executable Cypher queries for a Knowledge Graph, explains the query in plain language, and accepts natural-language amendments to edit the query. They evaluate the explanation and fault-detection abilities on a 90-query synthetic movie KG and test query-generation on two real KGs (MaRDI and a Hyena research KG). Some models exceed 70% one-sentence explanation accuracy and a few combine high fault-detection with low false positives. Domain shift matters: many models do well on the MaRDI software KG but fewer succeed consistently on expert hyena questions.

Problem Statement

LLMs are fluent but can hallucinate, be outdated, and hide reasoning. Text-based RAG struggles with multi-hop queries. Knowledge Graphs are precise but require learning query languages (Cypher/SPARQL), creating a usability gap for non-experts who need verifiable, multi-step answers.

Main Contribution

An end-to-end, modular pipeline where an LLM generates Cypher queries, explains them in plain language, executes them on Neo4j, and edits them via natural-language amendments.

A controlled 90-query benchmark (synthetic movie KG) to measure explanation quality and fault detection across multiple LLMs, plus two real-world case studies on MaRDI and a Hyena KG.

Key Findings

Some LLMs produce correct and complete one-sentence explanations on the synthetic 90-query benchmark.

Numberso1-preview, deepseek-reasoner-api, o3-mini ≥70% accuracy (n=90).

Practical UsePick models like o1-preview or deepseek-reasoner-api for explanation-first workflows; expect roughly 70–77% correct short explanations on similar, controlled KG queries.

Evidence RefTable 2; Sec.3.5.1

Certain models reliably detect injected query faults while avoiding false alarms.

Numberso1-preview and deepseek-reasoner-api flagged faults >85% and had false-positive rates <10% (n=75 perturbed queries).

Practical UseUse these models when you need automatic sanity checks on generated Cypher; they balance sensitivity and specificity better than others.

Evidence RefTable 2; Sec.3.5.2 & 3.5.3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracyo1-preview 76.6%, deepseek-reasoner-api 73.3%, o3-mini 71.1%, deepseek-r1:70b 66.6%, claude 52.2% (n=90)Synthetic movie KG (90 queries)Table 2 shows per-model accuracies for explanation summariesTable 2
Problem (perturbation) detection rateo1-preview 88.0%, deepseek-reasoner-api 89.3%, claude 85.3%, o3-mini 77.3%, deepseek-r1:70b 68.0% (n=75)Perturbed queries (synthetic benchmark)Table 2 reports detection rates per modelTable 2

What To Try In 7 Days

Wire a simple LangChain + Neo4j flow and feed an LLM the KG schema to generate Cypher for 10 representative questions.

Add an explanation prompt that outputs a one-sentence summary plus flagged issues for each generated query.

Run a small perturbation test (flip relation, wrong node type) to measure your model's fault-detection vs false-positive tradeoff.

Agent Features

Memory
No persistent long-term memory; uses schema and current query context
Tool Use
Cypher query generationNeo4j executionLangChain prompts
Frameworks
LangChain
Is Agentic

Yes

Architectures
LLM-centered modular pipeline
Collaboration
Human-in-the-loop (iterative amendment)

Optimization Features

Token Efficiency
Schema-aware prompting to reduce irrelevant tokens
System Optimization
Post-processing to remove LLM artifacts and ensure valid Cypher

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

MaRDI public SPARQL/REST endpoints (paper states MaRDI exposes public endpoints)https://hyena-project.com (Hyena project reference; access controlled)

Risks & Boundaries

Limitations

Synthetic benchmark is small (90 queries) and may not capture real-world schema complexity.

MaRDI and Hyena experiments use small, curated question sets (9 and 5 questions), limiting generalizability.

When Not To Use

When you need fully automated, zero-human intervention workflows.

For extremely large or rapidly changing graphs without schema constraints.

Failure Modes

One-sentence explanations drop numeric/time constraints (years omitted), harming faithfulness.

Flipped relationship directions and contradictory WHERE clauses can completely mislead the model.

Core Entities

Models

o1-previewo3-minideepseek-reasoner-apideepseek-r1:70bclaude-3.7-sonnetgpt-5.2o1o3o4-minigemma3qwqphi4llama3.3nemotron

Metrics

Accuracyproblem detection ratefalse positive rate

Datasets

Synthetic Movie KG (90-query benchmark)MaRDI KG subgraph (software/publication slice)Hyena KG (Ngorongoro field data)

Benchmarks

90-query Cypher explanation benchmark (movie KG)MaRDI 9-question query setHyena 5-question expert set