Forecast future facts on temporal knowledge graphs using LLM in‑context learning with no fine‑tuning.

Overview

Decision SnapshotReady For Pilot

The paper shows robust experimental evidence across several public TKG datasets that ICL can match supervised baselines in many cases, but it uses limited model sizes and is evaluated in an inductive setting only.

Citations8

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 3/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 50%

Novelty: 60%

Authors

Dong-Ho Lee, Kian Ahrabian, Woojeong Jin, Fred Morstatter, Jay Pujara

Links

Abstract / PDF / Code / Data

Why It Matters For Business

You can forecast structured future events from past facts using off‑the‑shelf LLMs without costly retraining, which speeds deployment and reduces model maintenance.

Who Should Care

ML Engineer Data Scientist Product Manager CTO

Summary TLDR

The authors convert temporal knowledge graph (TKG) forecasting into an in‑context learning (ICL) problem for large language models (LLMs). They turn historical graph facts into structured prompts and decode model token probabilities to rank candidate future facts. Across WIKI, YAGO, ICEWS14/18 and an ACLED slice, open LLMs (e.g., GPT‑NeoX) reach performance close to supervised SOTA (within -3.6% to +1.5% Hits@1 median gap) and beat simple frequency/recency baselines by large margins. Replacing entity/relation names with numeric IDs barely changes results, implying LLMs mainly exploit symbolic patterns in the prompt rather than prior semantics.

Problem Statement

Temporal knowledge graph forecasting asks: given past time‑stamped facts, predict missing future facts. Current methods need supervised training and custom architectures. The paper asks whether pre‑trained LLMs, using only in‑context examples turned from history, can forecast future links without any fine‑tuning.

Main Contribution

A simple three‑stage ICL pipeline that (1) retrieves relevant past facts, (2) serializes them into structured prompts (index or lexical), and (3) decodes LLM token probabilities to score candidate entities.

Large experimental comparison showing pre‑trained LLMs (GPT2/J/NeoX and gpt‑3.5‑turbo) match or nearly match supervised TKG models on common benchmarks without training.

Key Findings

Pretrained LLMs (ICL) reach near‑SOTA forecasting performance without fine‑tuning.

NumbersLLM Hits@1 gap vs median supervised: -3.6% to +1.5%

Practical UseYou can skip expensive task‑specific training for some TKG tasks by using ICL with a large model and structured history prompts.

Evidence RefAbstract; Section 5; Table 5

LLMs outperform simple heuristics based on frequency or recency by a meaningful margin.

NumbersICL > heuristics by +10% to +28% Hits@1

Practical UseIn practice, LLM prompts capture patterns beyond just repeating the most frequent or most recent past answer; use ICL instead of naive baselines.

Evidence RefAbstract; Section 5.1; Table 5/Table 10

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Hits@1 (single-step)	GPT-NeoX (Entity) 0.784 on YAGO	Timetraveler 0.845 (best supervised on YAGO)	-6.1% vs best supervised; within -3.6% to +1.5% vs median	YAGO (single-step)	Table 5; single-step block	Table 5
Hits@1 vs heuristics	GPT-NeoX (Entity) 0.324 vs frequency 0.243 on ICEWS14	frequency heuristic 0.243 (single-step)	+8.1 ppt (≈ +33% rel.)	ICEWS14 (single-step)	Table 5 and Table 10	Table 10

What To Try In 7 Days

Serialize a small historical slice of your domain graph into the paper's 'index' prompt format and call a large pre‑trained LLM to rank candidate next facts.

Compare ICL predictions to simple heuristics (most recent/most frequent) and your existing supervised model on Hits@1 to gauge parity.

If data privacy is a concern, test anonymized numeric IDs in prompts; performance often stays similar.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/usc-isi-i2/isi-tkg-icl

Data URLs

WIKI (Leblay and Chekol 2018)YAGO (Mahdisoltani et al. 2014)ICEWS14/ICEWS18 (García‑Durán et al. 2018)https://data.humdata.org/organization/acled (ACLED)

Risks & Boundaries

Limitations

Experiments limited to small/medium open models due to compute; results may change with larger or different models.

Method assumes candidate answers appear in observed histories (inductive setting) and does not handle unseen entities (transductive-only).

When Not To Use

When answers can be entities never observed in history (transductive future entities).

When you require calibrated probability estimates for downstream decision making.

Failure Modes

Top‑token decoding may omit numeric labels; paper sets rank=100 for missing tokens, producing false negatives.

Accumulation of errors in multi‑step mode when the model's predictions are re‑fed as history.

Core Entities

Models

GPT2gpt-j-6bgpt-neox-20bgpt-3.5-turboGPT-NeoXGPT-J

Metrics

Hits@1Hits@3Hits@10Time-aware filter

Datasets

WIKIYAGOICEWS14ICEWS18ACLED-CD22

Benchmarks

Temporal Knowledge Graph (TKG) forecasting

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Pretrained LLMs (ICL) reach near‑SOTA forecasting performance without fine‑tuning.

LLMs outperform simple heuristics based on frequency or recency by a meaningful margin.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Use LLMs to synthesize context examples and cut expert annotation by ~40–60% for biomedical entity linking

Key finding

LLM judges are prompt‑sensitive and internally noisy; here's a explainable toolkit to measure and de-noise them

Key finding

SCORE: report accuracy ranges and consistency, not just one score

Key finding

Open-source, reproducible benchmark that compares 10+ LLMs on 20+ tasks and traces the path from GPT-3 to GPT-4

Key finding

KemenkeuGPT: a LangChain+RAG LLM for Indonesian finance that raised accuracy from 35% to 61%

Key finding