Inject low-rank, input-dependent prompts into aggregated features to recover accuracy of low-bit quantized GNNs

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

Authors

Chenyu Liu, Haige Li, Luca Rossi

Links

Abstract / PDF

Why It Matters For Business

LoRAP lets teams deploy low-bit GNNs with much of the original accuracy retained while keeping memory and speed gains; it is a small, trainable add-on compatible with existing QAT pipelines.

Summary TLDR

Quantizing Graph Neural Networks (GNNs) to low bit-widths hurts accuracy because feature quantization errors accumulate during neighbor aggregation. LoRAP inserts small, input-dependent low-rank prompts after aggregation (post-aggregation) to directly correct those errors. Across 4 quantization-aware training frameworks and many datasets, LoRAP + node prompts (GPF-LoRAP) consistently recovers accuracy for INT4/low-bit GNNs, sometimes exceeding FP32, while adding little compute and memory when using a fused GPU kernel.

Problem Statement

Low-bit quantization of GNNs saves memory and speed but causes large accuracy drops because quantized node features create biased aggregated messages. Pre-aggregation node prompts cannot reliably fix topology-amplified errors. We need a lightweight, input-aware way to correct aggregation-level quantization error during training.

Main Contribution

Introduce LoRAP: post-aggregation, input-dependent prompts built from a small set of low-rank basis vectors.

Show theoretically that post-aggregation prompts decouple correction from graph operator and allow node-specific bias correction.

Integrate LoRAP into existing QAT pipelines and provide a fused Triton kernel to cut prompt-injection latency.

Empirically validate across 4 QAT methods, 3 GNN architectures, and large/small datasets that LoRAP consistently recovers low-bit performance.

Key Findings

GPF-LoRAP can recover severe INT4 accuracy losses on small benchmarks.

NumbersREDDIT-BINARY, QAT-W4A4: +17.2% acc

LoRAP sometimes surpasses full-precision accuracy on evaluated tasks.

NumbersCora, GIN A2Q + GPF-LoRAP: 78.5% vs FP32 77.6%

Fused LoRAP kernel halves prompt injection latency versus naive implementation.

Numbersprompt kernel 93.5µs → 44.5µs (2.1×)

Training overhead of LoRAP is small in practice.

NumbersA2Q training 1040.29s → +7.79s with GPF-LoRAP

Results

Accuracy

Value78.5%

BaselineFP32 77.6%

Accuracy

Value69.6% (GPF-LoRAP)

BaselineNone 52.4%

Accuracy

Value73.6% (+LoRAP)

BaselineFP32 71.7%

Prompt injection kernel latency

Value44.5 µs (fused)

Baseline93.5 µs (naive)

LoRA

Value0.37 ms

BaselineFP32 0.69 ms

Training time (A2Q)

Value1048.08 s (A2Q + GPF-LoRAP)

Baseline1040.29 s (A2Q)

Who Should Care

CtoMl EngineerEngineering LeadProduct ManagerData Scientist

What To Try In 7 Days

Run a baseline INT4 QAT pipeline on one GNN task and record accuracy/latency.

Add GPF-plus (node prompts) and LoRAP (aggregation prompts) with k≈40, r≈2 and retrain.

Measure accuracy recovery and per-layer latency; enable fused Triton kernel if available for production speedups.

Optimization Features

Infra Optimization

works with standard GPUs and Triton

Model Optimization

post-aggregation correction
low-rank prompt bases

System Optimization

kernel fusion to reduce DRAM accesses

Training Optimization

jointly optimize prompts and quantized weights
small extra training cost

Inference Optimization

fused Triton kernel to reduce memory traffic
keep activations/weights low-bit

Reproducibility

Code Urls

https://anonymous.4open.science/r/LoRAP-16F3/

Code Available

Data Available

Open Source Status

partial

Risks & Boundaries

Limitations

LoRAP uses high-precision prompt generation which adds small FP32 work and storage.
Approximation limited by prompt rank k; too small k may underfit quantization error.
Node prompting (pre-aggregation) can be unstable for some aggregations; LoRAP requires implementation changes.
Best latency requires fused Triton kernel; without it, overhead is larger.

When Not To Use

When full-precision FP32 models run fine and memory/speed are not constrained.
When you lack access to retraining or QAT pipeline to jointly optimize prompts.
When target hardware cannot run fused GPU kernels or support mixed precision.

Failure Modes

Poor k/r choices lead to under- or over-correction and accuracy drop.
EdgePrompt+ style additions can hurt performance if combined incorrectly.
If prompts are not fused, memory traffic can erase speed gains.

Core Entities

Models

GIN
GCN
GAT
LoRA
GPF-plus

Metrics

Accuracy
Mean Absolute Error (MAE)
Latency (ms / µs)
Training time (s)

Datasets

Cora
CiteSeer
PubMed
ogb-arxiv
ogb-products
ogbn-mag
MNIST (superpixel)
CIFAR-10 (superpixel)
REDDIT-BINARY
ZINC

Benchmarks

OGB (ogb-arxiv, ogb-products, ogbn-mag)
REDDIT-BINARY

Overview

Production Readiness

Novelty Score

Cost Impact Score

Citation Count

Authors

Links

Why It Matters For Business

Summary TLDR

Problem Statement

Main Contribution

Key Findings

GPF-LoRAP can recover severe INT4 accuracy losses on small benchmarks.

LoRAP sometimes surpasses full-precision accuracy on evaluated tasks.

Fused LoRAP kernel halves prompt injection latency versus naive implementation.

Training overhead of LoRAP is small in practice.

Results

Accuracy

Accuracy

Accuracy

Prompt injection kernel latency

LoRA

Training time (A2Q)

Who Should Care

What To Try In 7 Days

Optimization Features

Infra Optimization

Model Optimization

System Optimization

Training Optimization

Inference Optimization

Reproducibility

Code Urls

Code Available

Data Available

Open Source Status

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Related Papers