Inject knowledge-graph vectors and correlation matrices into transformer layers to improve GLUE tasks.

June 23, 20237 min

Overview

Decision SnapshotNeeds Validation

The method is straightforward and shows consistent GLUE gains across several transformers, but experiments are limited to GLUE, single-GPU runs, and summed embeddings from two KGs; real-world integration and robustness require more testing.

Citations3

Evidence Strength0.60

Confidence0.78

Risk Signals9

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 2/3

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 30%

Production readiness: 40%

Novelty: 50%

Authors

Kaushik Roy, Yuxin Zi, Vignesh Narayanan, Manas Gaur, Amit Sheth

Links

Abstract / PDF / Data

Why It Matters For Business

Injecting knowledge-graph embeddings into transformer internals can raise NLU accuracy and cut labeling needs; this helps deliver stronger models faster where domain or commonsense context matters.

Who Should Care

Summary TLDR

The paper defines a clear, modular way to add knowledge-graph information into transformer models. It compresses ConceptNet and WordNet into node vectors and a correlation matrix, then injects those vectors into latent token representations and the correlation matrix into self-attention matrices. The authors compare three strategies—shallow (first block vectors), semi-deep (first block vectors+attention), and deep (vectors+attention across all blocks)—and show deep infusion gives the largest gains on GLUE tasks and helps reduce labeled data needs. They also propose two new diagnostics: CGKA (checks link-prediction + GLUE performance) and DE@k (performance with k% training data).

Problem Statement

Transformers can miss implicit or missing context and therefore hallucinate or give unsafe/unhelpful outputs. Existing knowledge-augmentation methods are ad hoc: they add knowledge in different places without a systematic view and without diagnostics to tell whether performance gains are real or just artifact exploitation.

Main Contribution

A modular taxonomy of where to infuse knowledge in transformers: latent representations (vectors) vs inductive biases (self-attention matrices).

Three concrete infusion strategies: shallow (vectors at first block), semi-deep (vectors+attention at first block), deep (vectors+attention across all blocks).

Key Findings

Deep infusion (vectors + attention across all blocks) raises GLUE task scores over baseline XLNet.

NumbersXLNet MNLI: baseline 72.3% -> deep 88.53% (+16.23pp) (Table 1)

Practical UseIf you can modify a model's attention and representations, infusing knowledge across layers can give substantial accuracy gains on NLU tasks; test deep infusion first.

Evidence RefTable 1

Deep infusion improves multiple GLUE tasks, not just one.

NumbersXLNet QNLI 84.17% -> 92.3% (+8.13pp); QQP 74.79% -> 80.9% (+6.11pp) (Table 1)

Practical UseKnowledge infusion helps both entailment and similarity tasks; use it when your workload spans varied NLU problems.

Evidence RefTable 1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy88.53%72.3%+16.23ppMNLI (GLUE)Table 1 shows XLNet baseline vs deep-infusionTable 1
Accuracy92.3%84.17%+8.13ppQNLI (GLUE)Table 1 shows XLNet baseline vs deep-infusionTable 1

What To Try In 7 Days

Run a shallow test: add per-token ConceptNet/WordNet vectors to the first transformer block and measure GLUE or your task.

Try semi-deep: add the node-embedding correlation matrix to the first block's attention and compare to shallow.

If you can, run deep infusion across all blocks on a small model and measure accuracy and DE@50 to check data savings.

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Only evaluated on GLUE tasks; no downstream production case studies.

Knowledge compression uses summed ConceptNet and WordNet embeddings, which may blur distinct senses.

When Not To Use

When no reliable knowledge graph exists for your domain.

When inference latency or memory constraints forbid modifying attention matrices at runtime.

Failure Modes

Model may exploit spurious correlations between KG features and labels (overfitting to artifacts).

Errors or biases in the knowledge graphs will be injected into model predictions.

Core Entities

Models

XLNetBERTRoBERTaELECTRALongformerKSAT (Knowledge-Infused Self-Attention Transformer)

Metrics

AccuracyF1CGKADE@k

Datasets

GLUEMNLIQNLIWNLIRTEQQP

Benchmarks

GLUE