Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.4
Citation Count
1
Why It Matters For Business
Decomposed, retrieval-enhanced prompting gives more accurate structured events without fine-tuning, reducing manual labeling and improving downstream dashboards and knowledge graphs in days rather than months.
Summary TLDR
The paper proposes a two-step, prompt-based pipeline for event extraction with LLMs: (1) Event Detection (ED) to find triggers and types, then (2) Event Argument Extraction (EAE) for role filling. Prompts are enriched with precise schema, extraction rules, output format and retrieval-augmented examples (RAE) fetched via FAISS embeddings. On ACE05-EN, WikiEvents and a synthetic MaritimeEvent (~10k samples) the approach improves F1 over plain few-shot and prior LLM prompting, e.g., GPT-4 5-shot+RAE achieves Trig-C/Arg-C 81.09/58.24 on ACE05-EN and 84.32/60.79 on MaritimeEvent. ADA-002 embeddings worked best for retrieval. The method reduces hallucination risk but needs prompt engineering and L
Problem Statement
LLMs can extract structured events from text but often hallucinate or miss details when prompts are long or generic. The challenge is to get accurate triggers, event types, and argument roles from documents without large supervised fine-tuning.
Main Contribution
A two-step prompt pipeline that decomposes event extraction into Event Detection and Event Argument Extraction.
Schema-aware, granular prompts that include extraction rules, output format, and dynamic retrieval-augmented examples.
A synthetic MaritimeEvent dataset (~10k examples) for a maritime domain evaluation.
Empirical evidence that retrieval-augmented examples and prompt decomposition improve F1 vs simple few-shot prompting and prior LLM prompting baselines.
Key Findings
Retrieval-augmented examples (RAE) plus decomposition raises ACE05-EN F1 for GPT-4.
Decomposed prompting meaningfully improves accuracy vs single-step prompts.
Embedding choice matters: ADA-002 produced the best retrieval results.
Results
ACE05-EN Trig-C (GPT-4, 5-shot RAE)
ACE05-EN Arg-C (GPT-4, 5-shot RAE)
MaritimeEvent Trig-C (GPT-4, 5-shot RAE)
MaritimeEvent Arg-C (GPT-4, 5-shot RAE)
WikiEvent Trig-C (GPT-4, 5-shot RAE)
Text2Event (T5-large) Trig-C / Arg-C (ACE05-EN)
Who Should Care
What To Try In 7 Days
Implement a 2-step prompt: ED then EAE for your event schema and test on a small validation set.
Add FAISS-based retrieval using ADA-002 embeddings to feed 3–5 nearest examples into prompts.
Run 5-shot experiments with gpt-3.5-turbo or GPT-4 and compare F1 gains against current extractor.
Reproducibility
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Relies on API-access LLMs (GPT-4/GPT-3.5) which incur cost and privacy concerns.
- Long document prompts remain costly and may require large-context models or chunking.
- MaritimeEvent is synthetic (ChatGPT-generated) and may not reflect real-world distribution.
- WikiEvents has limited training data, reducing retrieval benefits there.
When Not To Use
- When you cannot send text to external LLM APIs for privacy or compliance reasons.
- When compute or budget prevents frequent large-model API calls.
- For extremely long documents without access to large-context LLMs or careful chunking.
Failure Modes
- Retrieved examples are irrelevant and cause hallucination or wrong labels.
- ED errors cascade: wrong event types lead to wrong argument extraction.
- Prompt truncation or token limits drop essential schema or examples.
Core Entities
Models
- GPT-4
- gpt-3.5-turbo
- Llama2-7B
- T5-large
- RoBERTa-base
Metrics
- Trig-C
- Arg-C
- F1
Datasets
- ACE05-EN
- WIKIEVENTS
- MaritimeEvent
Context Entities
Models
- ChatGPT

