E-KELL: a KG-backed LLM system that guides decisions with standards-based evidence to cut hallucinations

November 15, 20237 min

Overview

Production Readiness

0.5

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

4

Authors

Minze Chen, Zhenxiang Tao, Weitong Tang, Tingxin Qin, Rui Yang, Chunli Zhu

Links

Abstract / PDF

Why It Matters For Business

For safety-critical operations, E-KELL-style KG+LLM reduces hallucination and ensures answers trace back to standards, lowering legal and operational risk while making guidance faster and more auditable.

Summary TLDR

E-KELL is a prototype emergency decision support system that stores Chinese emergency standards as a structured knowledge graph (2264 triples) and uses a prompt-chain to make an LLM reason over relevant KG segments. In a hazardous-chemical leakage case study (10 representative queries) E-KELL matched standards, avoided factual errors, and scored ~9/10 from 19 domain experts on clarity, accuracy, conciseness, and instructiveness. The approach reduces LLM hallucination and yields auditable answers, but building and updating the KG required semi-automatic extraction plus manual curation and the system currently relies on limited document coverage.

Problem Statement

Emergency decisions must follow laws and technical standards, but raw LLM outputs can hallucinate and miss logical links embedded across fragmented documents (tables, diagrams). EDSS need fast, auditable, and standards-compliant guidance; current LLMs alone lack reliable referencing and structured reasoning over heterogeneous regulatory texts.

Main Contribution

A practical EDSS framework (E-KELL) that stores emergency standards in a knowledge graph (KG) and guides an LLM to reason over KG segments via a prompt-chain.

A semi-automatic pipeline to extract triples from Chinese emergency documents and a curated KG (2264 triples) used as the authoritative knowledge base.

A prototype and case study (hazardous chemical leakage) showing improved factual correctness and expert-rated usability versus baseline LLMs.

Key Findings

E-KELL produced factually correct and standards-compliant answers on the 10 evaluated queries.

NumbersFactually correct 10/10; In compliance with standards 10/10 (Table 1)

Domain experts rated E-KELL higher on usability metrics than the baselines.

NumbersComprehensibility 9.06; Accuracy 9.09; Conciseness 9.03; Instructiveness 9.06 (Table 2)

The knowledge graph used in the prototype contains 2,264 curated triples built from official Chinese documents.

NumbersKnowledge graph size = 2264 triples (Section 4)

The prototype relies on semi-automatic extraction plus manual fusion and lacks wide document coverage and real-time data.

Results

Grammatically correct (10 queries)

ValueE-KELL 10/10; ChatGLM-6b 10/10; GPT-3.5 10/10

Factually correct (10 queries)

ValueE-KELL 10/10; ChatGLM-6b 8/10; GPT-3.5 9/10

In compliance with standards/regulations (10 queries)

ValueE-KELL 10/10; ChatGLM-6b 4/10; GPT-3.5 7/10

Expert subjective scores (average)

ValueE-KELL: Comprehensibility 9.06; Accuracy 9.09; Conciseness 9.03; Instructiveness 9.06

BaselineChatGLM-6b and GPT-3.5 reported in Table 2

Who Should Care

What To Try In 7 Days

Extract 1–2 critical local standards and build a tiny KG (10–100 triples) for a frequent emergency scenario.

Connect that KG to an LLM via a retrieval index (Llama Index) and run 10 representative queries vs the plain LLM to compare factual compliance.

Publish a prompt template that forces the model to cite source triples and iterate templates based on user feedback.

Agent Features

Tool Use

  • Llama Index (vector retrieval)
  • OCR for document ingestion
  • Mixed Reality UI for frontline

Frameworks

  • LLM + Knowledge Graph prompt-chain

Optimization Features

Infra Optimization

  • Local deployment on NVIDIA A100

Reproducibility

Open Source Status

  • no

Risks & Boundaries

Limitations

  • Limited document coverage: prototype built on a small set of Chinese standards and quick references.
  • KG construction required substantial manual curation; automatic fusion was insufficient.
  • Prompt templates and logical decomposition need further testing across wider query types.
  • No real-time sensor integration or multimodal inputs in current prototype.

When Not To Use

  • As the sole or authoritative decision-maker without human review.
  • For emergencies outside the documents and standards loaded into the KG.
  • Where real-time sensor data or image/video evidence is the primary decision input (not yet integrated).

Failure Modes

  • Incomplete or outdated KG leads to incorrect or non-compliant advice.
  • Retrieval misses relevant triples, causing the LLM to hallucinate from its base weights.
  • Poor prompt decomposition yields wrong logical queries over the KG.

Core Entities

Models

  • ChatGLM-6B
  • GPT-3.5

Metrics

  • Accuracy
  • objective attribute scores (grammatical, factual, compliance)