Knowledge graph triples GPT-4 accuracy for enterprise QA (16.7% → 54.2%)

November 13, 20237 min

Overview

Production Readiness

0.3

Novelty Score

0.45

Cost Impact Score

0.5

Citation Count

7

Authors

Juan Sequeda, Dean Allemang, Bryon Jacob

Links

Abstract / PDF

Why It Matters For Business

Adding a knowledge graph layer (ontology + mappings) substantially improves LLM answer accuracy on enterprise SQL: expect major gains for normalized schemas and KPI-style questions.

Summary TLDR

The paper builds a small enterprise Text-to-SQL benchmark (insurance domain, 13-table subset of the OMG P&C model, 43 questions, ontology + mappings) and tests GPT-4 with simple zero-shot prompts. Asking GPT-4 directly against the SQL gave 16.7% average accuracy. Asking GPT-4 to generate SPARQL over a knowledge-graph view of the same data gave 54.2% accuracy (a 37.5 percentage-point improvement). The gain is largest when schema complexity grows; SQL dropped to 0% for high-schema questions. Caveats: single model (GPT-4), zero-shot only, synthetic small dataset, and a subset of the full schema.

Problem Statement

Existing Text-to-SQL benchmarks do not match enterprise realities: large, normalized schemas, business metrics/KPIs, and a separate business context layer (ontology, mappings) are missing. The paper asks how well LLMs answer enterprise questions on SQL and whether adding a knowledge graph (KG) context helps.

Main Contribution

A reproducible enterprise-style benchmark: subset of OMG Property & Casualty schema, 43 natural-language questions, and a context layer (OWL ontology + R2RML mappings).

A controlled experiment comparing GPT-4 zero-shot SQL generation vs. SPARQL over a KG, with execution-based accuracy scoring.

Open artifacts and code published to reproduce and extend the benchmark (DDL, CSVs, OWL, R2RML, reference SQL/SPARQL, GitHub).

Key Findings

Knowledge-graph context raised GPT-4 execution accuracy from 16.7% to 54.2%.

NumbersSQL 16.7% → SPARQL 54.2% (Table 1)

For low-question/low-schema tasks, SPARQL hit 71.1% vs SQL 25.5%.

NumbersLQ/LS: SQL 25.5% → SPARQL 71.1% (Table 1)

When questions touch many tables (>4), SQL accuracy dropped to 0% while SPARQL still reached ~36–39%.

NumbersHigh-schema quadrants: SQL 0%; SPARQL 35.7%–38.7% (Table 1)

Generated answers often showed partial correctness (subset of columns) and syntactic errors on date arithmetic.

SQL failures were dominated by hallucinations (columns, values, joins); SPARQL failures were path/direction errors in the ontology.

Results

Accuracy

Value16.7%

Accuracy

Value54.2%

BaselineSQL 16.7%

AOEA Low Question / Low Schema

ValueSQL 25.5% | SPARQL 71.1%

BaselineSQL 25.5%

AOEA High Question / Low Schema

ValueSQL 37.4% | SPARQL 66.9%

BaselineSQL 37.4%

AOEA Low Question / High Schema

ValueSQL 0% | SPARQL 35.7%

BaselineSQL 0%

AOEA High Question / High Schema

ValueSQL 0% | SPARQL 38.7%

BaselineSQL 0%

Who Should Care

What To Try In 7 Days

Run the provided benchmark repo against a small slice of your schema to measure current LLM accuracy.

Create an OWL ontology and simple R2RML mappings for your top 5 business concepts and test SPARQL vs SQL prompts.

Add post-checks for missing columns and identifier-to-label mapping to catch partial answers.

Reproducibility

Data Urls

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Single model (GPT-4) and single prompting style (zero-shot) tested
  • Subset of full OMG P&C schema (13 of 199 tables) and small synthetic data per table
  • 43 questions only; lacks broad coverage of filters, time windows, ambiguous phrasings
  • Date functions and some operations not fully tested; mapping patterns limited to mostly 1-1

When Not To Use

  • When you need guaranteed, auditable correctness without human verification
  • When your schema and business semantics are not modeled or mapped to a KG
  • For real-time low-latency use if virtualization adds unacceptable overhead

Failure Modes

  • Column name hallucinations (SQL produced non-existent columns)
  • Value hallucinations used as filters
  • Wrong or invented joins in SQL
  • Incorrect path or direction in SPARQL property traversal
  • Partially correct results that return identifiers instead of labels

Core Entities

Models

  • GPT-4

Metrics

  • Accuracy

Datasets

  • OMG Property & Casualty Data Model (subset, 13 tables)
  • Generated sample CSV data (benchmark instantiation)

Benchmarks

  • This paper's enterprise Text-to-SQL benchmark

Context Entities

Datasets

  • OWL ontology (insurance)
  • R2RML mappings