Overview
Production Readiness
0.3
Novelty Score
0.45
Cost Impact Score
0.5
Citation Count
7
Why It Matters For Business
Adding a knowledge graph layer (ontology + mappings) substantially improves LLM answer accuracy on enterprise SQL: expect major gains for normalized schemas and KPI-style questions.
Summary TLDR
The paper builds a small enterprise Text-to-SQL benchmark (insurance domain, 13-table subset of the OMG P&C model, 43 questions, ontology + mappings) and tests GPT-4 with simple zero-shot prompts. Asking GPT-4 directly against the SQL gave 16.7% average accuracy. Asking GPT-4 to generate SPARQL over a knowledge-graph view of the same data gave 54.2% accuracy (a 37.5 percentage-point improvement). The gain is largest when schema complexity grows; SQL dropped to 0% for high-schema questions. Caveats: single model (GPT-4), zero-shot only, synthetic small dataset, and a subset of the full schema.
Problem Statement
Existing Text-to-SQL benchmarks do not match enterprise realities: large, normalized schemas, business metrics/KPIs, and a separate business context layer (ontology, mappings) are missing. The paper asks how well LLMs answer enterprise questions on SQL and whether adding a knowledge graph (KG) context helps.
Main Contribution
A reproducible enterprise-style benchmark: subset of OMG Property & Casualty schema, 43 natural-language questions, and a context layer (OWL ontology + R2RML mappings).
A controlled experiment comparing GPT-4 zero-shot SQL generation vs. SPARQL over a KG, with execution-based accuracy scoring.
Open artifacts and code published to reproduce and extend the benchmark (DDL, CSVs, OWL, R2RML, reference SQL/SPARQL, GitHub).
Key Findings
Knowledge-graph context raised GPT-4 execution accuracy from 16.7% to 54.2%.
For low-question/low-schema tasks, SPARQL hit 71.1% vs SQL 25.5%.
When questions touch many tables (>4), SQL accuracy dropped to 0% while SPARQL still reached ~36–39%.
Generated answers often showed partial correctness (subset of columns) and syntactic errors on date arithmetic.
SQL failures were dominated by hallucinations (columns, values, joins); SPARQL failures were path/direction errors in the ontology.
Results
Accuracy
Accuracy
AOEA Low Question / Low Schema
AOEA High Question / Low Schema
AOEA Low Question / High Schema
AOEA High Question / High Schema
Who Should Care
What To Try In 7 Days
Run the provided benchmark repo against a small slice of your schema to measure current LLM accuracy.
Create an OWL ontology and simple R2RML mappings for your top 5 business concepts and test SPARQL vs SQL prompts.
Add post-checks for missing columns and identifier-to-label mapping to catch partial answers.
Reproducibility
Data Urls
- https://www.omg.org/cgi-bin/doc?dtc/13-04-15.ddl
- data.world (benchmark workspace with CSVs, OWL, R2RML, reference queries)
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Single model (GPT-4) and single prompting style (zero-shot) tested
- Subset of full OMG P&C schema (13 of 199 tables) and small synthetic data per table
- 43 questions only; lacks broad coverage of filters, time windows, ambiguous phrasings
- Date functions and some operations not fully tested; mapping patterns limited to mostly 1-1
When Not To Use
- When you need guaranteed, auditable correctness without human verification
- When your schema and business semantics are not modeled or mapped to a KG
- For real-time low-latency use if virtualization adds unacceptable overhead
Failure Modes
- Column name hallucinations (SQL produced non-existent columns)
- Value hallucinations used as filters
- Wrong or invented joins in SQL
- Incorrect path or direction in SPARQL property traversal
- Partially correct results that return identifiers instead of labels
Core Entities
Models
- GPT-4
Metrics
- Accuracy
Datasets
- OMG Property & Casualty Data Model (subset, 13 tables)
- Generated sample CSV data (benchmark instantiation)
Benchmarks
- This paper's enterprise Text-to-SQL benchmark
Context Entities
Datasets
- OWL ontology (insurance)
- R2RML mappings

