Overview
The paper shows robust experimental evidence across several public TKG datasets that ICL can match supervised baselines in many cases, but it uses limited model sizes and is evaluated in an inductive setting only.
Citations8
Evidence Strength0.80
Confidence0.80
Risk Signals9
Trust Signals
Findings with numeric evidence: 3/3
Findings with evidence refs: 3/3
Results with explicit delta: 3/3
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 50%
Novelty: 60%
Why It Matters For Business
You can forecast structured future events from past facts using off‑the‑shelf LLMs without costly retraining, which speeds deployment and reduces model maintenance.
Who Should Care
Summary TLDR
The authors convert temporal knowledge graph (TKG) forecasting into an in‑context learning (ICL) problem for large language models (LLMs). They turn historical graph facts into structured prompts and decode model token probabilities to rank candidate future facts. Across WIKI, YAGO, ICEWS14/18 and an ACLED slice, open LLMs (e.g., GPT‑NeoX) reach performance close to supervised SOTA (within -3.6% to +1.5% Hits@1 median gap) and beat simple frequency/recency baselines by large margins. Replacing entity/relation names with numeric IDs barely changes results, implying LLMs mainly exploit symbolic patterns in the prompt rather than prior semantics.
Problem Statement
Temporal knowledge graph forecasting asks: given past time‑stamped facts, predict missing future facts. Current methods need supervised training and custom architectures. The paper asks whether pre‑trained LLMs, using only in‑context examples turned from history, can forecast future links without any fine‑tuning.
Main Contribution
A simple three‑stage ICL pipeline that (1) retrieves relevant past facts, (2) serializes them into structured prompts (index or lexical), and (3) decodes LLM token probabilities to score candidate entities.
Large experimental comparison showing pre‑trained LLMs (GPT2/J/NeoX and gpt‑3.5‑turbo) match or nearly match supervised TKG models on common benchmarks without training.
Key Findings
Pretrained LLMs (ICL) reach near‑SOTA forecasting performance without fine‑tuning.
LLMs outperform simple heuristics based on frequency or recency by a meaningful margin.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Hits@1 (single-step) | GPT-NeoX (Entity) 0.784 on YAGO | Timetraveler 0.845 (best supervised on YAGO) | -6.1% vs best supervised; within -3.6% to +1.5% vs median | YAGO (single-step) | Table 5; single-step block | Table 5 |
| Hits@1 vs heuristics | GPT-NeoX (Entity) 0.324 vs frequency 0.243 on ICEWS14 | frequency heuristic 0.243 (single-step) | +8.1 ppt (≈ +33% rel.) | ICEWS14 (single-step) | Table 5 and Table 10 | Table 10 |
What To Try In 7 Days
Serialize a small historical slice of your domain graph into the paper's 'index' prompt format and call a large pre‑trained LLM to rank candidate next facts.
Compare ICL predictions to simple heuristics (most recent/most frequent) and your existing supervised model on Hits@1 to gauge parity.
If data privacy is a concern, test anonymized numeric IDs in prompts; performance often stays similar.
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Experiments limited to small/medium open models due to compute; results may change with larger or different models.
Method assumes candidate answers appear in observed histories (inductive setting) and does not handle unseen entities (transductive-only).
When Not To Use
When answers can be entities never observed in history (transductive future entities).
When you require calibrated probability estimates for downstream decision making.
Failure Modes
Top‑token decoding may omit numeric labels; paper sets rank=100 for missing tokens, producing false negatives.
Accumulation of errors in multi‑step mode when the model's predictions are re‑fed as history.

