Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
This pipeline lets non-programmers query curated Knowledge Graphs with auditable Cypher and guided edits, reducing developer bottlenecks and increasing trust for knowledge-intensive applications.
Summary TLDR
The authors build an interactive pipeline where an LLM generates executable Cypher queries for a Knowledge Graph, explains the query in plain language, and accepts natural-language amendments to edit the query. They evaluate the explanation and fault-detection abilities on a 90-query synthetic movie KG and test query-generation on two real KGs (MaRDI and a Hyena research KG). Some models exceed 70% one-sentence explanation accuracy and a few combine high fault-detection with low false positives. Domain shift matters: many models do well on the MaRDI software KG but fewer succeed consistently on expert hyena questions.
Problem Statement
LLMs are fluent but can hallucinate, be outdated, and hide reasoning. Text-based RAG struggles with multi-hop queries. Knowledge Graphs are precise but require learning query languages (Cypher/SPARQL), creating a usability gap for non-experts who need verifiable, multi-step answers.
Main Contribution
An end-to-end, modular pipeline where an LLM generates Cypher queries, explains them in plain language, executes them on Neo4j, and edits them via natural-language amendments.
A controlled 90-query benchmark (synthetic movie KG) to measure explanation quality and fault detection across multiple LLMs, plus two real-world case studies on MaRDI and a Hyena KG.
Empirical insights into common LLM failure modes for Cypher (dropping explicit years, flipped relationships, false positives) and the value of a human-in-the-loop amendment loop.
Key Findings
Some LLMs produce correct and complete one-sentence explanations on the synthetic 90-query benchmark.
Certain models reliably detect injected query faults while avoiding false alarms.
A common recurrent error is omission of explicit years from one-sentence summaries.
On a software-focused slice of the MaRDI KG, many LLMs produced correct Cypher queries with few or no amendments.
Domain shift reduces success: expert hyena questions were harder for many models.
Results
Accuracy
Problem (perturbation) detection rate
False positive avoidance on correct queries
Accuracy
Accuracy
Accuracy
Who Should Care
What To Try In 7 Days
Wire a simple LangChain + Neo4j flow and feed an LLM the KG schema to generate Cypher for 10 representative questions.
Add an explanation prompt that outputs a one-sentence summary plus flagged issues for each generated query.
Run a small perturbation test (flip relation, wrong node type) to measure your model's fault-detection vs false-positive tradeoff.
Agent Features
Memory
- No persistent long-term memory; uses schema and current query context
Tool Use
- Cypher query generation
- Neo4j execution
- LangChain prompts
Frameworks
- LangChain
Is Agentic
true
Architectures
- LLM-centered modular pipeline
Collaboration
- Human-in-the-loop (iterative amendment)
Optimization Features
Token Efficiency
- Schema-aware prompting to reduce irrelevant tokens
System Optimization
- Post-processing to remove LLM artifacts and ensure valid Cypher
Reproducibility
Data Urls
- MaRDI public SPARQL/REST endpoints (paper states MaRDI exposes public endpoints)
- https://hyena-project.com (Hyena project reference; access controlled)
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Synthetic benchmark is small (90 queries) and may not capture real-world schema complexity.
- MaRDI and Hyena experiments use small, curated question sets (9 and 5 questions), limiting generalizability.
- Human scoring of explanation completeness introduces subjectivity.
- LLM performance is a snapshot and may change as models evolve.
When Not To Use
- When you need fully automated, zero-human intervention workflows.
- For extremely large or rapidly changing graphs without schema constraints.
- If strict data privacy prevents sending schema or queries to hosted LLM APIs (use local/private models).
Failure Modes
- One-sentence explanations drop numeric/time constraints (years omitted), harming faithfulness.
- Flipped relationship directions and contradictory WHERE clauses can completely mislead the model.
- Some models produce many false alarms, making human triage costly.
- Performance can collapse under domain shift; a model that works on one KG may fail on another.
Core Entities
Models
- o1-preview
- o3-mini
- deepseek-reasoner-api
- deepseek-r1:70b
- claude-3.7-sonnet
- gpt-5.2
- o1
- o3
- o4-mini
- gemma3
- qwq
- phi4
- llama3.3
- nemotron
Metrics
- Accuracy
- problem detection rate
- false positive rate
Datasets
- Synthetic Movie KG (90-query benchmark)
- MaRDI KG subgraph (software/publication slice)
- Hyena KG (Ngorongoro field data)
Benchmarks
- 90-query Cypher explanation benchmark (movie KG)
- MaRDI 9-question query set
- Hyena 5-question expert set

