Inject knowledge-graph vectors and correlation matrices into transformer layers to improve GLUE tasks.

June 23, 20237 min

Overview

Production Readiness

0.4

Novelty Score

0.5

Cost Impact Score

0.3

Citation Count

3

Authors

Kaushik Roy, Yuxin Zi, Vignesh Narayanan, Manas Gaur, Amit Sheth

Links

Abstract / PDF

Why It Matters For Business

Injecting knowledge-graph embeddings into transformer internals can raise NLU accuracy and cut labeling needs; this helps deliver stronger models faster where domain or commonsense context matters.

Summary TLDR

The paper defines a clear, modular way to add knowledge-graph information into transformer models. It compresses ConceptNet and WordNet into node vectors and a correlation matrix, then injects those vectors into latent token representations and the correlation matrix into self-attention matrices. The authors compare three strategies—shallow (first block vectors), semi-deep (first block vectors+attention), and deep (vectors+attention across all blocks)—and show deep infusion gives the largest gains on GLUE tasks and helps reduce labeled data needs. They also propose two new diagnostics: CGKA (checks link-prediction + GLUE performance) and DE@k (performance with k% training data).

Problem Statement

Transformers can miss implicit or missing context and therefore hallucinate or give unsafe/unhelpful outputs. Existing knowledge-augmentation methods are ad hoc: they add knowledge in different places without a systematic view and without diagnostics to tell whether performance gains are real or just artifact exploitation.

Main Contribution

A modular taxonomy of where to infuse knowledge in transformers: latent representations (vectors) vs inductive biases (self-attention matrices).

Three concrete infusion strategies: shallow (vectors at first block), semi-deep (vectors+attention at first block), deep (vectors+attention across all blocks).

A simple compression pipeline: create node embeddings from ConceptNet and WordNet, sum per-token, and form a node-embedding correlation matrix for attention infusion.

Two evaluation diagnostics: CGKA (combines graph-encoder link-prediction and GLUE accuracy) and DE@k (data-efficiency when training with k% data).

Empirical comparison on multiple transformer families (XLNet, BERT, RoBERTa, ELECTRA, Longformer) on GLUE showing consistent gains, with deep infusion best.

Key Findings

Deep infusion (vectors + attention across all blocks) raises GLUE task scores over baseline XLNet.

NumbersXLNet MNLI: baseline 72.3% -> deep 88.53% (+16.23pp) (Table 1)

Deep infusion improves multiple GLUE tasks, not just one.

NumbersXLNet QNLI 84.17% -> 92.3% (+8.13pp); QQP 74.79% -> 80.9% (+6.11pp) (Table 1)

Knowledge infusion reduces labeled-data needs in practice.

NumbersAverage GLUE accuracy stays in the 70–80% range with 50% training data when infusion is used (DE@50, Table 2)

Results

Accuracy

Value88.53%

Baseline72.3%

Accuracy

Value92.3%

Baseline84.17%

Accuracy

Value≈80% (XLNet averaged)

BaselineFull-data baseline varies; infusion yields 70–80s at 50% data

Who Should Care

What To Try In 7 Days

Run a shallow test: add per-token ConceptNet/WordNet vectors to the first transformer block and measure GLUE or your task.

Try semi-deep: add the node-embedding correlation matrix to the first block's attention and compare to shallow.

If you can, run deep infusion across all blocks on a small model and measure accuracy and DE@50 to check data savings.

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Only evaluated on GLUE tasks; no downstream production case studies.
  • Knowledge compression uses summed ConceptNet and WordNet embeddings, which may blur distinct senses.
  • Experiments run on a single A100 and with standard model sizes; scalability costs are not measured.

When Not To Use

  • When no reliable knowledge graph exists for your domain.
  • When inference latency or memory constraints forbid modifying attention matrices at runtime.
  • When the task is unrelated to commonsense or factual context.

Failure Modes

  • Model may exploit spurious correlations between KG features and labels (overfitting to artifacts).
  • Errors or biases in the knowledge graphs will be injected into model predictions.
  • Summing multiple KG embeddings per token can blur meanings and hurt tasks needing fine-grained senses.

Core Entities

Models

  • XLNet
  • BERT
  • RoBERTa
  • ELECTRA
  • Longformer
  • KSAT (Knowledge-Infused Self-Attention Transformer)

Metrics

  • Accuracy
  • F1
  • CGKA
  • DE@k

Datasets

  • GLUE
  • MNLI
  • QNLI
  • WNLI
  • RTE
  • QQP

Benchmarks

  • GLUE