Overview
Production Readiness
0.4
Novelty Score
0.5
Cost Impact Score
0.3
Citation Count
3
Why It Matters For Business
Injecting knowledge-graph embeddings into transformer internals can raise NLU accuracy and cut labeling needs; this helps deliver stronger models faster where domain or commonsense context matters.
Summary TLDR
The paper defines a clear, modular way to add knowledge-graph information into transformer models. It compresses ConceptNet and WordNet into node vectors and a correlation matrix, then injects those vectors into latent token representations and the correlation matrix into self-attention matrices. The authors compare three strategies—shallow (first block vectors), semi-deep (first block vectors+attention), and deep (vectors+attention across all blocks)—and show deep infusion gives the largest gains on GLUE tasks and helps reduce labeled data needs. They also propose two new diagnostics: CGKA (checks link-prediction + GLUE performance) and DE@k (performance with k% training data).
Problem Statement
Transformers can miss implicit or missing context and therefore hallucinate or give unsafe/unhelpful outputs. Existing knowledge-augmentation methods are ad hoc: they add knowledge in different places without a systematic view and without diagnostics to tell whether performance gains are real or just artifact exploitation.
Main Contribution
A modular taxonomy of where to infuse knowledge in transformers: latent representations (vectors) vs inductive biases (self-attention matrices).
Three concrete infusion strategies: shallow (vectors at first block), semi-deep (vectors+attention at first block), deep (vectors+attention across all blocks).
A simple compression pipeline: create node embeddings from ConceptNet and WordNet, sum per-token, and form a node-embedding correlation matrix for attention infusion.
Two evaluation diagnostics: CGKA (combines graph-encoder link-prediction and GLUE accuracy) and DE@k (data-efficiency when training with k% data).
Empirical comparison on multiple transformer families (XLNet, BERT, RoBERTa, ELECTRA, Longformer) on GLUE showing consistent gains, with deep infusion best.
Key Findings
Deep infusion (vectors + attention across all blocks) raises GLUE task scores over baseline XLNet.
Deep infusion improves multiple GLUE tasks, not just one.
Knowledge infusion reduces labeled-data needs in practice.
Results
Accuracy
Accuracy
Accuracy
Who Should Care
What To Try In 7 Days
Run a shallow test: add per-token ConceptNet/WordNet vectors to the first transformer block and measure GLUE or your task.
Try semi-deep: add the node-embedding correlation matrix to the first block's attention and compare to shallow.
If you can, run deep infusion across all blocks on a small model and measure accuracy and DE@50 to check data savings.
Reproducibility
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Only evaluated on GLUE tasks; no downstream production case studies.
- Knowledge compression uses summed ConceptNet and WordNet embeddings, which may blur distinct senses.
- Experiments run on a single A100 and with standard model sizes; scalability costs are not measured.
When Not To Use
- When no reliable knowledge graph exists for your domain.
- When inference latency or memory constraints forbid modifying attention matrices at runtime.
- When the task is unrelated to commonsense or factual context.
Failure Modes
- Model may exploit spurious correlations between KG features and labels (overfitting to artifacts).
- Errors or biases in the knowledge graphs will be injected into model predictions.
- Summing multiple KG embeddings per token can blur meanings and hurt tasks needing fine-grained senses.
Core Entities
Models
- XLNet
- BERT
- RoBERTa
- ELECTRA
- Longformer
- KSAT (Knowledge-Infused Self-Attention Transformer)
Metrics
- Accuracy
- F1
- CGKA
- DE@k
Datasets
- GLUE
- MNLI
- QNLI
- WNLI
- RTE
- QQP
Benchmarks
- GLUE

