Inject low-rank, input-dependent prompts into aggregated features to recover accuracy of low-bit quantized GNNs

January 21, 20266 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

0

Authors

Chenyu Liu, Haige Li, Luca Rossi

Links

Abstract / PDF

Why It Matters For Business

LoRAP lets teams deploy low-bit GNNs with much of the original accuracy retained while keeping memory and speed gains; it is a small, trainable add-on compatible with existing QAT pipelines.

Summary TLDR

Quantizing Graph Neural Networks (GNNs) to low bit-widths hurts accuracy because feature quantization errors accumulate during neighbor aggregation. LoRAP inserts small, input-dependent low-rank prompts after aggregation (post-aggregation) to directly correct those errors. Across 4 quantization-aware training frameworks and many datasets, LoRAP + node prompts (GPF-LoRAP) consistently recovers accuracy for INT4/low-bit GNNs, sometimes exceeding FP32, while adding little compute and memory when using a fused GPU kernel.

Problem Statement

Low-bit quantization of GNNs saves memory and speed but causes large accuracy drops because quantized node features create biased aggregated messages. Pre-aggregation node prompts cannot reliably fix topology-amplified errors. We need a lightweight, input-aware way to correct aggregation-level quantization error during training.

Main Contribution

Introduce LoRAP: post-aggregation, input-dependent prompts built from a small set of low-rank basis vectors.

Show theoretically that post-aggregation prompts decouple correction from graph operator and allow node-specific bias correction.

Integrate LoRAP into existing QAT pipelines and provide a fused Triton kernel to cut prompt-injection latency.

Empirically validate across 4 QAT methods, 3 GNN architectures, and large/small datasets that LoRAP consistently recovers low-bit performance.

Key Findings

GPF-LoRAP can recover severe INT4 accuracy losses on small benchmarks.

NumbersREDDIT-BINARY, QAT-W4A4: +17.2% acc

LoRAP sometimes surpasses full-precision accuracy on evaluated tasks.

NumbersCora, GIN A2Q + GPF-LoRAP: 78.5% vs FP32 77.6%

Fused LoRAP kernel halves prompt injection latency versus naive implementation.

Numbersprompt kernel 93.5µs → 44.5µs (2.1×)

Training overhead of LoRAP is small in practice.

NumbersA2Q training 1040.29s → +7.79s with GPF-LoRAP

Results

Accuracy

Value78.5%

BaselineFP32 77.6%

Accuracy

Value69.6% (GPF-LoRAP)

BaselineNone 52.4%

Accuracy

Value73.6% (+LoRAP)

BaselineFP32 71.7%

Prompt injection kernel latency

Value44.5 µs (fused)

Baseline93.5 µs (naive)

LoRA

Value0.37 ms

BaselineFP32 0.69 ms

Training time (A2Q)

Value1048.08 s (A2Q + GPF-LoRAP)

Baseline1040.29 s (A2Q)

Who Should Care

What To Try In 7 Days

Run a baseline INT4 QAT pipeline on one GNN task and record accuracy/latency.

Add GPF-plus (node prompts) and LoRAP (aggregation prompts) with k≈40, r≈2 and retrain.

Measure accuracy recovery and per-layer latency; enable fused Triton kernel if available for production speedups.

Optimization Features

Infra Optimization

  • works with standard GPUs and Triton

Model Optimization

  • post-aggregation correction
  • low-rank prompt bases

System Optimization

  • kernel fusion to reduce DRAM accesses

Training Optimization

  • jointly optimize prompts and quantized weights
  • small extra training cost

Inference Optimization

  • fused Triton kernel to reduce memory traffic
  • keep activations/weights low-bit

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • LoRAP uses high-precision prompt generation which adds small FP32 work and storage.
  • Approximation limited by prompt rank k; too small k may underfit quantization error.
  • Node prompting (pre-aggregation) can be unstable for some aggregations; LoRAP requires implementation changes.
  • Best latency requires fused Triton kernel; without it, overhead is larger.

When Not To Use

  • When full-precision FP32 models run fine and memory/speed are not constrained.
  • When you lack access to retraining or QAT pipeline to jointly optimize prompts.
  • When target hardware cannot run fused GPU kernels or support mixed precision.

Failure Modes

  • Poor k/r choices lead to under- or over-correction and accuracy drop.
  • EdgePrompt+ style additions can hurt performance if combined incorrectly.
  • If prompts are not fused, memory traffic can erase speed gains.

Core Entities

Models

  • GIN
  • GCN
  • GAT
  • LoRA
  • GPF-plus

Metrics

  • Accuracy
  • Mean Absolute Error (MAE)
  • Latency (ms / µs)
  • Training time (s)

Datasets

  • Cora
  • CiteSeer
  • PubMed
  • ogb-arxiv
  • ogb-products
  • ogbn-mag
  • MNIST (superpixel)
  • CIFAR-10 (superpixel)
  • REDDIT-BINARY
  • ZINC

Benchmarks

  • OGB (ogb-arxiv, ogb-products, ogbn-mag)
  • REDDIT-BINARY