Overview
Method is practically ready for prototyping (works with API LLMs and FAISS) but relies on proprietary models and prompt engineering; results are consistent across datasets but large-document and privacy constraints add deployment costs.
Citations1
Evidence Strength0.60
Confidence0.85
Risk Signals10
Trust Signals
Findings with numeric evidence: 3/3
Findings with evidence refs: 3/3
Results with explicit delta: 5/6
Reproducibility
Status: No open assets linked
Open source: Unknown
At A Glance
Cost impact: 40%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
Decomposed, retrieval-enhanced prompting gives more accurate structured events without fine-tuning, reducing manual labeling and improving downstream dashboards and knowledge graphs in days rather than months.
Who Should Care
Summary TLDR
The paper proposes a two-step, prompt-based pipeline for event extraction with LLMs: (1) Event Detection (ED) to find triggers and types, then (2) Event Argument Extraction (EAE) for role filling. Prompts are enriched with precise schema, extraction rules, output format and retrieval-augmented examples (RAE) fetched via FAISS embeddings. On ACE05-EN, WikiEvents and a synthetic MaritimeEvent (~10k samples) the approach improves F1 over plain few-shot and prior LLM prompting, e.g., GPT-4 5-shot+RAE achieves Trig-C/Arg-C 81.09/58.24 on ACE05-EN and 84.32/60.79 on MaritimeEvent. ADA-002 embeddings worked best for retrieval. The method reduces hallucination risk but needs prompt engineering and L
Problem Statement
LLMs can extract structured events from text but often hallucinate or miss details when prompts are long or generic. The challenge is to get accurate triggers, event types, and argument roles from documents without large supervised fine-tuning.
Main Contribution
A two-step prompt pipeline that decomposes event extraction into Event Detection and Event Argument Extraction.
Schema-aware, granular prompts that include extraction rules, output format, and dynamic retrieval-augmented examples.
Key Findings
Retrieval-augmented examples (RAE) plus decomposition raises ACE05-EN F1 for GPT-4.
Decomposed prompting meaningfully improves accuracy vs single-step prompts.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| ACE05-EN Trig-C (GPT-4, 5-shot RAE) | 81.09 | GPT-4 5-shot (no RAE) | +5.18 | ACE05-EN test | Table II | Table II |
| ACE05-EN Arg-C (GPT-4, 5-shot RAE) | 58.24 | GPT-4 5-shot (no RAE) | +6.29 | ACE05-EN test | Table II | Table II |
What To Try In 7 Days
Implement a 2-step prompt: ED then EAE for your event schema and test on a small validation set.
Add FAISS-based retrieval using ADA-002 embeddings to feed 3–5 nearest examples into prompts.
Run 5-shot experiments with gpt-3.5-turbo or GPT-4 and compare F1 gains against current extractor.
Reproducibility
Risks & Boundaries
Limitations
Relies on API-access LLMs (GPT-4/GPT-3.5) which incur cost and privacy concerns.
Long document prompts remain costly and may require large-context models or chunking.
When Not To Use
When you cannot send text to external LLM APIs for privacy or compliance reasons.
When compute or budget prevents frequent large-model API calls.
Failure Modes
Retrieved examples are irrelevant and cause hallucination or wrong labels.
ED errors cascade: wrong event types lead to wrong argument extraction.

