Teach an LLM to read graph structure with two-stage instruction tuning and a tiny alignment projector

October 19, 20237 min

Overview

Production Readiness

0.5

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

15

Authors

Jiabin Tang, Yuhao Yang, Wei Wei, Lei Shi, Lixin Su, Suqi Cheng, Dawei Yin, Chao Huang

Links

Abstract / PDF

Why It Matters For Business

GraphGPT enables LLMs to use graph structure with low-cost tuning, improving cross-dataset predictions and saving compute by using compact graph tokens instead of long text prompts.

Summary TLDR

GraphGPT injects graph structure into an LLM by projecting precomputed graph embeddings into special "graph tokens" and instruction-tuning the LLM in two stages: (1) self-supervised graph matching to align graph tokens with text, and (2) task-specific instruction tuning. Freezing the LLM and graph encoder and tuning only a lightweight projector keeps costs low. GraphGPT improves supervised and zero-shot node classification and link prediction on OGB-arxiv, PubMed, and Cora versus standard GNNs and base LLMs, and it uses Chain-of-Thought (CoT) distillation to boost performance on hard tasks.

Problem Statement

GNNs need labeled data to generalize well. Pure-text prompts for LLMs lose graph structure or become too long. The problem: how to make LLMs understand graph structure so they generalize across graph tasks and transfer zero-shot without large labeled datasets.

Main Contribution

A text-graph grounding scheme that encodes graph structure as compact graph tokens aligned with text embeddings.

A dual-stage graph instruction tuning: (1) self-supervised graph matching to align structure and language; (2) task-specific instruction tuning for node classification/link prediction.

A lightweight graph-text alignment projector enabling tuning with frozen LLM/graph encoder, plus CoT distillation from GPT-3.5 to improve stepwise reasoning.

Key Findings

GraphGPT improves zero-shot transfer accuracy compared to base LLMs and GNNs on evaluated benchmarks.

NumbersArxiv-PubMed zero-shot: GraphGPT-7B-v1.5-std Acc=0.7011 vs vicuna-7B-v1.5 Acc=0.6351 (Δ=+0.066)

Self-supervised graph matching stage materially improves supervised accuracy and zero-shot stability.

NumbersArxiv-Arxiv Acc drops from 0.6258 (ours) to 0.4962 (w/o GS) (Δ=-0.1296)

Freezing LLM and graph encoder and tuning only the projector cuts tuned parameters by >50× and avoids OOM.

NumbersTuned params: 131,612,672 (freeze) vs 6,607,884,288 (tune) (≈50.2× reduction)

Chain-of-Thought distillation helps on complex, high-class-count datasets (Cora).

NumbersArxiv-Cora Acc: GraphGPT-7B-v1.5-cot=0.1813 vs std=0.1256 (Δ=+0.0557)

Graph tokens drastically reduce token usage versus text-based structure prompts.

Numbers103-node subgraph: Graph tokens use 750 tokens vs 4,649 tokens for text-based prompt

Results

Accuracy

Value0.7511

Baselinevicuna-7B-v1.5

Accuracy

Value0.7011

Baselinevicuna-7B-v1.5

Accuracy

Value0.1813

BaselineGraphGPT-7B-v1.5-std

Link prediction AUC

Value0.8246

BaselineNode2Vec (0.6535)

Who Should Care

What To Try In 7 Days

Run a small proof: freeze your LLM and graph encoder, train a linear projector on your unlabeled graph subgraphs for a target node classification task.

Compare tokenized subgraph inputs vs text-based graph prompts to measure token and latency savings.

If classes are many or reasoning is needed, add CoT-style distilled instructions from a stronger LLM to your instruction mix.

Agent Features

Frameworks

  • Dual-stage instruction tuning

Architectures

  • LLM + pre-trained GNN encoder

Optimization Features

Token Efficiency

  • Graph tokens: 750 vs text prompts: 4,649 tokens for a 103-node subgraph

Infra Optimization

  • Low batch-size training feasible due to small tuned parameter set

Model Optimization

  • Freeze large model weights; only tune projector

System Optimization

  • Works on single 40G A100 when freezing LLM; tuning full LLM causes OOM

Training Optimization

  • Self-supervised graph matching uses unlabeled graphs as instructions
  • Two-stage tuning reduces overfitting and supports multitask mixing

Inference Optimization

  • Compact graph tokens reduce input length and inference latency

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Evaluations focus on citation-like graphs (OGB-arxiv, PubMed, Cora); other graph types not tested.
  • Method depends on a pre-trained graph encoder; end-to-end learning is not evaluated.
  • Base LLM choice affects results; improvements reported for vicuna/baichuan variants only.
  • CoT distillation requires access to a stronger closed-source LLM for best effect.

When Not To Use

  • If your graph domain is very different from citation/text-attributed graphs and no suitable graph encoder exists.
  • If you must fine-tune the LLM weights end-to-end on very large data but lack memory (this method freezes the LLM).
  • When legal/operational rules forbid using closed-source models for distillation.

Failure Modes

  • Overfitting when skipping self-supervised graph matching (worse zero-shot transfer).
  • Poor performance on very high-class problems without CoT or richer instruction data.
  • Misalignment if the projector cannot map graph embeddings into the LLM token space for novel graph structures.

Core Entities

Models

  • GraphGPT-7B-v1.5
  • GraphGPT-7B-v1.1
  • vicuna-7B-v1.5
  • vicuna-7B-v1.1
  • baichuan-7B
  • GPT-3.5 (used for CoT distillation)

Metrics

  • Accuracy
  • Macro-F1
  • AUC
  • AP

Datasets

  • OGB-arxiv
  • PubMed
  • Cora (expanded, 70 classes)

Benchmarks

  • Supervised node classification
  • Zero-shot node classification
  • Link prediction

Context Entities

Models

  • GraphSAGE
  • GCN
  • GAT
  • RevGNN
  • DGI
  • GKD
  • GLNN
  • NodeFormer
  • DIFFormer
  • Node2Vec
  • MLP