Graphusion: zero-shot LLM pipeline that builds and fuses scientific concept graphs for NLP tutoring

July 15, 20247 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

3

Authors

Rui Yang, Boming Yang, Sixun Ouyang, Tianwei She, Aosong Feng, Yuang Jiang, Freddy Lecue, Jinghui Lu, Irene Li

Links

Abstract / PDF

Why It Matters For Business

Graphusion cuts expert labeling by using LLMs plus a fusion step to build domain concept graphs, which can immediately improve tutoring and QA services without large supervised datasets.

Summary TLDR

Graphusion is a zero-shot pipeline that uses large language models to extract concept triplets from free text, then fuses them into a single knowledge graph (entity merging, conflict resolution, new triplet inference). Applied to NLP papers and lecture materials, Graphusion + RAG (retrieval) improves link-prediction accuracy by up to ~10% vs. supervised baselines and yields high expert ratings for extracted concepts (2.92/3) though relation quality is weaker (2.37/3). The authors also release TutorQA, a 1,200-item, expert-verified QA benchmark for graph-backed tutoring, and show KG augmentation markedly improves tutoring tasks (e.g., Task 1 accuracy 69.2% -> 92%).

Problem Statement

Automatic knowledge graph construction from free text usually extracts triplets locally (single sentence) and needs expert labeling. This leaves scientific concept graphs incomplete or inconsistent. The paper asks: can LLMs do zero-shot extraction plus a global fusion step to build usable scientific KGs for educational QA?

Main Contribution

Graphusion: a zero-shot pipeline that extracts candidate triplets with LLMs and fuses them via entity merging, conflict resolution, and novel triplet inference.

TutorQA: a new expert-verified, NLP-focused tutoring benchmark with 1,200 QA pairs across six tasks for concept-graph reasoning.

Evaluation and ablations showing LLMs (especially GPT-4/4o) + RAG recover concept graphs better than several supervised baselines and that fusion improves relation quality.

Key Findings

LLM zero-shot link prediction with retrieval outperforms supervised baselines on LectureBankCD (NLP).

NumbersGPT-4o (RAG) Accuracy 0.8117 vs BERT 0.7088 (+0.1029)

Experts rate extracted concept entities high but relations lower.

NumbersEntity rating 2.92/3, Relation rating 2.37/3 (GPT-4o)

KG augmentation substantially improves tutoring QA across several tasks.

NumbersTask1 accuracy 69.2% -> 92%; Task2 similarity 64.42 -> 80.29

The graph fusion module materially improves relation quality versus extraction alone.

Results

Accuracy

ValueGPT-4o (RAG) 0.8117

BaselineBERT 0.7088

Accuracy

Value92.0

BaselineZero-shot 69.2

Human expert ratings (concepts / relations)

ValueConcepts 2.92 / Relations 2.37 (out of 3)

Who Should Care

What To Try In 7 Days

Run BERTopic on your domain corpus to create seed concepts and sample a few abstracts.

Use GPT-4 or GPT-4o with the provided LP and Extraction prompts to generate candidate triplets.

Apply a simple fusion step (merge synonyms, resolve conflicts) and inspect top 100 triplets for entity correctness and relation errors with a subject-matter expert (SME).

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Quality and scale of the input corpus strongly affect graph quality; authors used abstracts only.
  • Relation extraction is less reliable than entity extraction and benefits from expert review.
  • Evaluation relies heavily on human experts; automatic metrics struggle with novel LLM outputs (similarity score ignores order).
  • RAG with long noisy documents can hurt performance; careful retrieval filtering is needed.

When Not To Use

  • When you need provably correct relations without human verification.
  • On very small or noisy corpora where retrieval will add noise.
  • In high-stakes domains that require audited provenance for every relation.

Failure Modes

  • Hallucinated or incorrect relations created by the LLM.
  • Merging non-equivalent concepts (over-merging) or failing to merge synonyms.
  • Noise from long or low-quality retrieved documents hurting link prediction.
  • Incorrect concept granularity (too broad or too specific concepts).

Core Entities

Models

  • LLaMA2-70b
  • LLaMA3-70b
  • GPT-3.5
  • GPT-4
  • GPT-4o
  • BERT

Metrics

  • Accuracy
  • F1
  • Similarity score (BERT embeddings cosine)
  • Hit rate
  • Human rating (1-3 for KGC)
  • Human rating (1-5 for Task6 criteria)
  • Kappa (inter-annotator agreement)

Datasets

  • LectureBankCD
  • ACL proceedings (2017-2023 abstracts)
  • TutorialBank
  • NLP-Papers
  • TutorQA (this work)

Benchmarks

  • TutorQA
  • LectureBankCD (link prediction)