Inject knowledge-graph vectors and correlation matrices into transformer layers to improve GLUE tasks.

Overview

Decision SnapshotNeeds Validation

The method is straightforward and shows consistent GLUE gains across several transformers, but experiments are limited to GLUE, single-GPU runs, and summed embeddings from two KGs; real-world integration and robustness require more testing.

Citations3

Evidence Strength0.60

Confidence0.78

Risk Signals9

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 2/3

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 30%

Production readiness: 40%

Novelty: 50%

Authors

Kaushik Roy, Yuxin Zi, Vignesh Narayanan, Manas Gaur, Amit Sheth

Links

Abstract / PDF / Data

Why It Matters For Business

Injecting knowledge-graph embeddings into transformer internals can raise NLU accuracy and cut labeling needs; this helps deliver stronger models faster where domain or commonsense context matters.

Who Should Care

Product Manager ML Engineer CTO Data Scientist Engineering Lead

Summary TLDR

The paper defines a clear, modular way to add knowledge-graph information into transformer models. It compresses ConceptNet and WordNet into node vectors and a correlation matrix, then injects those vectors into latent token representations and the correlation matrix into self-attention matrices. The authors compare three strategies—shallow (first block vectors), semi-deep (first block vectors+attention), and deep (vectors+attention across all blocks)—and show deep infusion gives the largest gains on GLUE tasks and helps reduce labeled data needs. They also propose two new diagnostics: CGKA (checks link-prediction + GLUE performance) and DE@k (performance with k% training data).

Problem Statement

Transformers can miss implicit or missing context and therefore hallucinate or give unsafe/unhelpful outputs. Existing knowledge-augmentation methods are ad hoc: they add knowledge in different places without a systematic view and without diagnostics to tell whether performance gains are real or just artifact exploitation.

Main Contribution

A modular taxonomy of where to infuse knowledge in transformers: latent representations (vectors) vs inductive biases (self-attention matrices).

Three concrete infusion strategies: shallow (vectors at first block), semi-deep (vectors+attention at first block), deep (vectors+attention across all blocks).

Key Findings

Deep infusion (vectors + attention across all blocks) raises GLUE task scores over baseline XLNet.

NumbersXLNet MNLI: baseline 72.3% -> deep 88.53% (+16.23pp) (Table 1)

Practical UseIf you can modify a model's attention and representations, infusing knowledge across layers can give substantial accuracy gains on NLU tasks; test deep infusion first.

Evidence RefTable 1

Deep infusion improves multiple GLUE tasks, not just one.

NumbersXLNet QNLI 84.17% -> 92.3% (+8.13pp); QQP 74.79% -> 80.9% (+6.11pp) (Table 1)

Practical UseKnowledge infusion helps both entailment and similarity tasks; use it when your workload spans varied NLU problems.

Evidence RefTable 1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	88.53%	72.3%	+16.23pp	MNLI (GLUE)	Table 1 shows XLNet baseline vs deep-infusion	Table 1
Accuracy	92.3%	84.17%	+8.13pp	QNLI (GLUE)	Table 1 shows XLNet baseline vs deep-infusion	Table 1

What To Try In 7 Days

Run a shallow test: add per-token ConceptNet/WordNet vectors to the first transformer block and measure GLUE or your task.

Try semi-deep: add the node-embedding correlation matrix to the first block's attention and compare to shallow.

If you can, run deep infusion across all blocks on a small model and measure accuracy and DE@50 to check data savings.

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

https://conceptnet.io https://wordnet.princeton.edu

Risks & Boundaries

Limitations

Only evaluated on GLUE tasks; no downstream production case studies.

Knowledge compression uses summed ConceptNet and WordNet embeddings, which may blur distinct senses.

When Not To Use

When no reliable knowledge graph exists for your domain.

When inference latency or memory constraints forbid modifying attention matrices at runtime.

Failure Modes

Model may exploit spurious correlations between KG features and labels (overfitting to artifacts).

Errors or biases in the knowledge graphs will be injected into model predictions.

Core Entities

Models

XLNetBERTRoBERTaELECTRALongformerKSAT (Knowledge-Infused Self-Attention Transformer)

Metrics

AccuracyF1CGKADE@k

Datasets

GLUEMNLIQNLIWNLIRTEQQP

Benchmarks

GLUE

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Deep infusion (vectors + attention across all blocks) raises GLUE task scores over baseline XLNet.

Deep infusion improves multiple GLUE tasks, not just one.

Results

What To Try In 7 Days

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

A two-stage fine-tuning recipe (SFT + HIPO) and a new LegalHalBench to cut legal hallucinations in LLMs

Key finding

FlowerTune: an open leaderboard to benchmark federated fine-tuning of LLMs across NLP, finance, medical and code

Key finding

Fine-tuning LLaVA VLMs on 50k biomedical image-text pairs cuts hallucinations and improves VQA on LDRT literature

Key finding

SNFinLLM: Chinese financial LLM with domain pretraining, instruction tuning, DPO alignment, and calculator integration

Key finding

Train agents to judge actions via RL so they learn true self-reflection, not imitation

Key finding