Overview
Production Readiness
0.4
Novelty Score
0.5
Cost Impact Score
0.3
Citation Count
6
Why It Matters For Business
If your product relies on event reasoning (timelines, forecasting, causal diagnosis), off-the-shelf LLMs can detect plausible single events and causal intent but will likely fail on timeline accuracy, counterfactual edits, and script forecasting—test with EVEVAL before deployment.
Summary TLDR
The authors introduce EVEVAL, an 8-dataset benchmark that tests LLMs on event understanding (intra/inter), reasoning (causal, temporal, counterfactual, intent), and prediction (script and story). Evaluations (ChatGPT, BLOOM, BLOOMZ, Flan-T5) show LLMs reliably judge single-event plausibility and handle causal/intent relations, but they struggle with semantic similarity between events, temporality, counterfactual rewriting, and script-based prediction. Chain-of-thought adds little or can hurt; JSON-like structural event representations work about as well as natural language. In-context demonstrations improve many scores. Use EVEVAL to measure event-centric gaps before deploying LLMs in event‑
Problem Statement
There is no single, comprehensive benchmark to measure how well large language models understand, reason about, and predict events. That gap makes it hard to know which event skills LLMs have and where they fail in real tasks such as timeline construction, question answering, and action prediction.
Main Contribution
A hierarchical framework for event semantic processing covering understanding, reasoning, and prediction.
EVEVAL: a new benchmark composed of 8 existing datasets that span intra/inter-event understanding, multiple reasoning types, and prediction tasks.
Systematic experiments (zero-shot, CoT, in‑context) that surface strengths and weaknesses of current LLMs on event tasks.
Key Findings
LLMs learn single-event plausibility well but struggle to judge semantic similarity between events.
LLMs show strong causal and intent reasoning but weaker temporal and counterfactual reasoning.
Context length and richness improve future-event prediction; story tasks beat script tasks.
Chain-of-thought prompting gives little net gain and can reduce accuracy on many event tasks.
Structured (JSON) event representations perform about as well as plain natural language prompts.
Results
Accuracy
Accuracy
Accuracy
Accuracy
Counterfactual (TIMETRAVEL)
Accuracy
Accuracy
Accuracy
Who Should Care
What To Try In 7 Days
Run EVEVAL on your chosen LLM to measure event gaps relevant to your use case.
Add 8–16 in‑context demonstrations tailored to your event type; re-run and compare gains.
Replace or augment plain-text prompts with simple JSON event structures for downstream pipelines and check parsing ease.
Reproducibility
Open Source Status
- unknown
Risks & Boundaries
Limitations
- EVEVAL assembles 8 public datasets but does not cover all event varieties (multimodal events, domain-specific scripts).
- Reported results focus on zero-shot and a few LLMs; finetuning or task-specific models perform better on several tasks.
- Chain-of-thought outputs were sometimes hard to parse and sometimes reduced accuracy.
When Not To Use
- When you need precise temporal ordering or counterfactual story editing out-of-the-box.
- When your product requires robust script-based event forecasting without fine-tuning.
- For safety-critical timeline decisions without adding specialised temporal models or human oversight.
Failure Modes
- CoT prompting can lower accuracy for event tasks when it produces verbose but unhelpful chains.
- Models return confident but wrong semantic-similarity judgments between events.
- Script-based predictions show high variance and low accuracy in zero-shot settings.
Core Entities
Models
- ChatGPT
- BLOOM
- BLOOMZ
- Flan-T5
- BART
- RoBERTa
Metrics
- Accuracy
- BLEU4
- ROUGE-L
Datasets
- DTFit
- HardSim
- ECARE
- TRACIE
- TIMETRAVEL
- SocialIQA
- MCNC
- SCT
- EVEVAL
Benchmarks
- EVEVAL

