LLMs generate, explain and iteratively fix Cypher queries so non-experts can ask graph databases in plain English

February 5, 20268 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

0

Authors

Larissa Pusch, Alexandre Courtiol, Tim Conrad

Links

Abstract / PDF

Why It Matters For Business

This pipeline lets non-programmers query curated Knowledge Graphs with auditable Cypher and guided edits, reducing developer bottlenecks and increasing trust for knowledge-intensive applications.

Summary TLDR

The authors build an interactive pipeline where an LLM generates executable Cypher queries for a Knowledge Graph, explains the query in plain language, and accepts natural-language amendments to edit the query. They evaluate the explanation and fault-detection abilities on a 90-query synthetic movie KG and test query-generation on two real KGs (MaRDI and a Hyena research KG). Some models exceed 70% one-sentence explanation accuracy and a few combine high fault-detection with low false positives. Domain shift matters: many models do well on the MaRDI software KG but fewer succeed consistently on expert hyena questions.

Problem Statement

LLMs are fluent but can hallucinate, be outdated, and hide reasoning. Text-based RAG struggles with multi-hop queries. Knowledge Graphs are precise but require learning query languages (Cypher/SPARQL), creating a usability gap for non-experts who need verifiable, multi-step answers.

Main Contribution

An end-to-end, modular pipeline where an LLM generates Cypher queries, explains them in plain language, executes them on Neo4j, and edits them via natural-language amendments.

A controlled 90-query benchmark (synthetic movie KG) to measure explanation quality and fault detection across multiple LLMs, plus two real-world case studies on MaRDI and a Hyena KG.

Empirical insights into common LLM failure modes for Cypher (dropping explicit years, flipped relationships, false positives) and the value of a human-in-the-loop amendment loop.

Key Findings

Some LLMs produce correct and complete one-sentence explanations on the synthetic 90-query benchmark.

Numberso1-preview, deepseek-reasoner-api, o3-mini ≥70% accuracy (n=90).

Certain models reliably detect injected query faults while avoiding false alarms.

Numberso1-preview and deepseek-reasoner-api flagged faults >85% and had false-positive rates <10% (n=75 perturbed queries).

A common recurrent error is omission of explicit years from one-sentence summaries.

NumbersYear omissions accounted for >50% of errors for o1-preview/o3-mini and >80% for deepseek-reasoner-api on failed cases.

On a software-focused slice of the MaRDI KG, many LLMs produced correct Cypher queries with few or no amendments.

Numberso3-mini and GPT-5.2 solved 9/9 tasks; several models reached 7–8/9 within up to three attempts (n=9 questions, 14 models

Domain shift reduces success: expert hyena questions were harder for many models.

NumbersOnly o3 and deepseek-reasoner-api achieved 5/5 on the Hyena KG; several models scored 0/5 (n=5 expert questions).

Results

Accuracy

Valueo1-preview 76.6%, deepseek-reasoner-api 73.3%, o3-mini 71.1%, deepseek-r1:70b 66.6%, claude 52.2% (n=90)

Problem (perturbation) detection rate

Valueo1-preview 88.0%, deepseek-reasoner-api 89.3%, claude 85.3%, o3-mini 77.3%, deepseek-r1:70b 68.0% (n=75)

False positive avoidance on correct queries

Valuedeepseek-reasoner-api 100%, o3-mini 100%, o1-preview 93.3%, claude 66.6%, deepseek-r1:70b 53.3% (n=15)

Accuracy

ValueTop models: gpt-5.2 88.9% (8/9), o3-mini 88.9% (8/9); several models 77.8% (7/9) or 66.7% (6/9)

Accuracy

Valueo3-mini and GPT-5.2 100% (9/9); many models ≥77.8% within 3 tries

Accuracy

Valueo3 and deepseek-reasoner-api 100% (5/5); several models 0–80% (n=5)

Who Should Care

What To Try In 7 Days

Wire a simple LangChain + Neo4j flow and feed an LLM the KG schema to generate Cypher for 10 representative questions.

Add an explanation prompt that outputs a one-sentence summary plus flagged issues for each generated query.

Run a small perturbation test (flip relation, wrong node type) to measure your model's fault-detection vs false-positive tradeoff.

Agent Features

Memory

  • No persistent long-term memory; uses schema and current query context

Tool Use

  • Cypher query generation
  • Neo4j execution
  • LangChain prompts

Frameworks

  • LangChain

Is Agentic

true

Architectures

  • LLM-centered modular pipeline

Collaboration

  • Human-in-the-loop (iterative amendment)

Optimization Features

Token Efficiency

  • Schema-aware prompting to reduce irrelevant tokens

System Optimization

  • Post-processing to remove LLM artifacts and ensure valid Cypher

Reproducibility

Data Urls

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Synthetic benchmark is small (90 queries) and may not capture real-world schema complexity.
  • MaRDI and Hyena experiments use small, curated question sets (9 and 5 questions), limiting generalizability.
  • Human scoring of explanation completeness introduces subjectivity.
  • LLM performance is a snapshot and may change as models evolve.

When Not To Use

  • When you need fully automated, zero-human intervention workflows.
  • For extremely large or rapidly changing graphs without schema constraints.
  • If strict data privacy prevents sending schema or queries to hosted LLM APIs (use local/private models).

Failure Modes

  • One-sentence explanations drop numeric/time constraints (years omitted), harming faithfulness.
  • Flipped relationship directions and contradictory WHERE clauses can completely mislead the model.
  • Some models produce many false alarms, making human triage costly.
  • Performance can collapse under domain shift; a model that works on one KG may fail on another.

Core Entities

Models

  • o1-preview
  • o3-mini
  • deepseek-reasoner-api
  • deepseek-r1:70b
  • claude-3.7-sonnet
  • gpt-5.2
  • o1
  • o3
  • o4-mini
  • gemma3
  • qwq
  • phi4
  • llama3.3
  • nemotron

Metrics

  • Accuracy
  • problem detection rate
  • false positive rate

Datasets

  • Synthetic Movie KG (90-query benchmark)
  • MaRDI KG subgraph (software/publication slice)
  • Hyena KG (Ngorongoro field data)

Benchmarks

  • 90-query Cypher explanation benchmark (movie KG)
  • MaRDI 9-question query set
  • Hyena 5-question expert set