Forecast future facts on temporal knowledge graphs using LLM in‑context learning with no fine‑tuning.

May 17, 20237 min

Overview

Production Readiness

0.5

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

8

Authors

Dong-Ho Lee, Kian Ahrabian, Woojeong Jin, Fred Morstatter, Jay Pujara

Links

Abstract / PDF

Why It Matters For Business

You can forecast structured future events from past facts using off‑the‑shelf LLMs without costly retraining, which speeds deployment and reduces model maintenance.

Summary TLDR

The authors convert temporal knowledge graph (TKG) forecasting into an in‑context learning (ICL) problem for large language models (LLMs). They turn historical graph facts into structured prompts and decode model token probabilities to rank candidate future facts. Across WIKI, YAGO, ICEWS14/18 and an ACLED slice, open LLMs (e.g., GPT‑NeoX) reach performance close to supervised SOTA (within -3.6% to +1.5% Hits@1 median gap) and beat simple frequency/recency baselines by large margins. Replacing entity/relation names with numeric IDs barely changes results, implying LLMs mainly exploit symbolic patterns in the prompt rather than prior semantics.

Problem Statement

Temporal knowledge graph forecasting asks: given past time‑stamped facts, predict missing future facts. Current methods need supervised training and custom architectures. The paper asks whether pre‑trained LLMs, using only in‑context examples turned from history, can forecast future links without any fine‑tuning.

Main Contribution

A simple three‑stage ICL pipeline that (1) retrieves relevant past facts, (2) serializes them into structured prompts (index or lexical), and (3) decodes LLM token probabilities to score candidate entities.

Large experimental comparison showing pre‑trained LLMs (GPT2/J/NeoX and gpt‑3.5‑turbo) match or nearly match supervised TKG models on common benchmarks without training.

A targeted analysis showing LLMs still perform when entity/relation names are replaced with numeric indices, suggesting pattern learning from symbolic sequences rather than semantic priors.

Key Findings

Pretrained LLMs (ICL) reach near‑SOTA forecasting performance without fine‑tuning.

NumbersLLM Hits@1 gap vs median supervised: -3.6% to +1.5%

LLMs outperform simple heuristics based on frequency or recency by a meaningful margin.

NumbersICL > heuristics by +10% to +28% Hits@1

Semantic entity names are not required for good ICL performance.

NumbersPerformance change ≈ ±0.4% Hit@1 when using numeric indices

Results

Hits@1 (single-step)

ValueGPT-NeoX (Entity) 0.784 on YAGO

BaselineTimetraveler 0.845 (best supervised on YAGO)

Hits@1 vs heuristics

ValueGPT-NeoX (Entity) 0.324 vs frequency 0.243 on ICEWS14

Baselinefrequency heuristic 0.243 (single-step)

Robustness to anonymization

ValueIndex vs Lexical change ≈ ±0.4% Hit@1

Baselinelexical prompts

Who Should Care

What To Try In 7 Days

Serialize a small historical slice of your domain graph into the paper's 'index' prompt format and call a large pre‑trained LLM to rank candidate next facts.

Compare ICL predictions to simple heuristics (most recent/most frequent) and your existing supervised model on Hits@1 to gauge parity.

If data privacy is a concern, test anonymized numeric IDs in prompts; performance often stays similar.

Reproducibility

Data Urls

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Experiments limited to small/medium open models due to compute; results may change with larger or different models.
  • Method assumes candidate answers appear in observed histories (inductive setting) and does not handle unseen entities (transductive-only).
  • Approach can struggle with tokenizers that lack multi‑digit numeric tokens (noted for some model families).

When Not To Use

  • When answers can be entities never observed in history (transductive future entities).
  • When you require calibrated probability estimates for downstream decision making.
  • If your deployment cannot afford repeated LLM API calls or large model inference cost.

Failure Modes

  • Top‑token decoding may omit numeric labels; paper sets rank=100 for missing tokens, producing false negatives.
  • Accumulation of errors in multi‑step mode when the model's predictions are re‑fed as history.
  • Performance depends on careful history selection; including unrelated bidirectional facts can drop accuracy.

Core Entities

Models

  • GPT2
  • gpt-j-6b
  • gpt-neox-20b
  • gpt-3.5-turbo
  • GPT-NeoX
  • GPT-J

Metrics

  • Hits@1
  • Hits@3
  • Hits@10
  • Time-aware filter

Datasets

  • WIKI
  • YAGO
  • ICEWS14
  • ICEWS18
  • ACLED-CD22

Benchmarks

  • Temporal Knowledge Graph (TKG) forecasting