Forecast future facts on temporal knowledge graphs using LLM in‑context learning with no fine‑tuning.

May 17, 20237 min

Overview

Decision SnapshotReady For Pilot

The paper shows robust experimental evidence across several public TKG datasets that ICL can match supervised baselines in many cases, but it uses limited model sizes and is evaluated in an inductive setting only.

Citations8

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 3/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 50%

Novelty: 60%

Authors

Dong-Ho Lee, Kian Ahrabian, Woojeong Jin, Fred Morstatter, Jay Pujara

Links

Abstract / PDF / Code / Data

Why It Matters For Business

You can forecast structured future events from past facts using off‑the‑shelf LLMs without costly retraining, which speeds deployment and reduces model maintenance.

Who Should Care

Summary TLDR

The authors convert temporal knowledge graph (TKG) forecasting into an in‑context learning (ICL) problem for large language models (LLMs). They turn historical graph facts into structured prompts and decode model token probabilities to rank candidate future facts. Across WIKI, YAGO, ICEWS14/18 and an ACLED slice, open LLMs (e.g., GPT‑NeoX) reach performance close to supervised SOTA (within -3.6% to +1.5% Hits@1 median gap) and beat simple frequency/recency baselines by large margins. Replacing entity/relation names with numeric IDs barely changes results, implying LLMs mainly exploit symbolic patterns in the prompt rather than prior semantics.

Problem Statement

Temporal knowledge graph forecasting asks: given past time‑stamped facts, predict missing future facts. Current methods need supervised training and custom architectures. The paper asks whether pre‑trained LLMs, using only in‑context examples turned from history, can forecast future links without any fine‑tuning.

Main Contribution

A simple three‑stage ICL pipeline that (1) retrieves relevant past facts, (2) serializes them into structured prompts (index or lexical), and (3) decodes LLM token probabilities to score candidate entities.

Large experimental comparison showing pre‑trained LLMs (GPT2/J/NeoX and gpt‑3.5‑turbo) match or nearly match supervised TKG models on common benchmarks without training.

Key Findings

Pretrained LLMs (ICL) reach near‑SOTA forecasting performance without fine‑tuning.

NumbersLLM Hits@1 gap vs median supervised: -3.6% to +1.5%

Practical UseYou can skip expensive task‑specific training for some TKG tasks by using ICL with a large model and structured history prompts.

Evidence RefAbstract; Section 5; Table 5

LLMs outperform simple heuristics based on frequency or recency by a meaningful margin.

NumbersICL > heuristics by +10% to +28% Hits@1

Practical UseIn practice, LLM prompts capture patterns beyond just repeating the most frequent or most recent past answer; use ICL instead of naive baselines.

Evidence RefAbstract; Section 5.1; Table 5/Table 10

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Hits@1 (single-step)GPT-NeoX (Entity) 0.784 on YAGOTimetraveler 0.845 (best supervised on YAGO)-6.1% vs best supervised; within -3.6% to +1.5% vs medianYAGO (single-step)Table 5; single-step blockTable 5
Hits@1 vs heuristicsGPT-NeoX (Entity) 0.324 vs frequency 0.243 on ICEWS14frequency heuristic 0.243 (single-step)+8.1 ppt (≈ +33% rel.)ICEWS14 (single-step)Table 5 and Table 10Table 10

What To Try In 7 Days

Serialize a small historical slice of your domain graph into the paper's 'index' prompt format and call a large pre‑trained LLM to rank candidate next facts.

Compare ICL predictions to simple heuristics (most recent/most frequent) and your existing supervised model on Hits@1 to gauge parity.

If data privacy is a concern, test anonymized numeric IDs in prompts; performance often stays similar.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

WIKI (Leblay and Chekol 2018)YAGO (Mahdisoltani et al. 2014)ICEWS14/ICEWS18 (García‑Durán et al. 2018)https://data.humdata.org/organization/acled (ACLED)

Risks & Boundaries

Limitations

Experiments limited to small/medium open models due to compute; results may change with larger or different models.

Method assumes candidate answers appear in observed histories (inductive setting) and does not handle unseen entities (transductive-only).

When Not To Use

When answers can be entities never observed in history (transductive future entities).

When you require calibrated probability estimates for downstream decision making.

Failure Modes

Top‑token decoding may omit numeric labels; paper sets rank=100 for missing tokens, producing false negatives.

Accumulation of errors in multi‑step mode when the model's predictions are re‑fed as history.

Core Entities

Models

GPT2gpt-j-6bgpt-neox-20bgpt-3.5-turboGPT-NeoXGPT-J

Metrics

Hits@1Hits@3Hits@10Time-aware filter

Datasets

WIKIYAGOICEWS14ICEWS18ACLED-CD22

Benchmarks

Temporal Knowledge Graph (TKG) forecasting