Break event extraction into detect+extract and add schema-aware retrieval to cut hallucination and raise F1

Overview

Decision SnapshotNeeds Validation

Method is practically ready for prototyping (works with API LLMs and FAISS) but relies on proprietary models and prompt engineering; results are consistent across datasets but large-document and privacy constraints add deployment costs.

Citations1

Evidence Strength0.60

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 5/6

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 40%

Production readiness: 60%

Novelty: 60%

Authors

Fatemeh Shiri, Van Nguyen, Farhad Moghimifar, John Yoo, Gholamreza Haffari, Yuan-Fang Li

Links

Abstract / PDF

Why It Matters For Business

Decomposed, retrieval-enhanced prompting gives more accurate structured events without fine-tuning, reducing manual labeling and improving downstream dashboards and knowledge graphs in days rather than months.

Who Should Care

ML Engineer Data Scientist Engineering Lead Product Manager

Summary TLDR

The paper proposes a two-step, prompt-based pipeline for event extraction with LLMs: (1) Event Detection (ED) to find triggers and types, then (2) Event Argument Extraction (EAE) for role filling. Prompts are enriched with precise schema, extraction rules, output format and retrieval-augmented examples (RAE) fetched via FAISS embeddings. On ACE05-EN, WikiEvents and a synthetic MaritimeEvent (~10k samples) the approach improves F1 over plain few-shot and prior LLM prompting, e.g., GPT-4 5-shot+RAE achieves Trig-C/Arg-C 81.09/58.24 on ACE05-EN and 84.32/60.79 on MaritimeEvent. ADA-002 embeddings worked best for retrieval. The method reduces hallucination risk but needs prompt engineering and L

Problem Statement

LLMs can extract structured events from text but often hallucinate or miss details when prompts are long or generic. The challenge is to get accurate triggers, event types, and argument roles from documents without large supervised fine-tuning.

Main Contribution

A two-step prompt pipeline that decomposes event extraction into Event Detection and Event Argument Extraction.

Schema-aware, granular prompts that include extraction rules, output format, and dynamic retrieval-augmented examples.

Key Findings

Retrieval-augmented examples (RAE) plus decomposition raises ACE05-EN F1 for GPT-4.

Numbers+5.18 Trig-C, +6.29 Arg-C (GPT-4, 5-shot → 5-shot+RAE on ACE05-EN)

Practical UseIf you can add instance-level retrieval to prompts, expect several-point F1 gains on sentence-level EE with GPT-4; implement RAE for higher accuracy.

Evidence RefTable II (ACE05-EN, GPT-4 5-shot vs 5-shot RAE)

Decomposed prompting meaningfully improves accuracy vs single-step prompts.

Numbers+8.3 Trig-C, +4.64 Arg-C (ChatGPT, 5-shot RAE, ACE05-EN)

Practical UseSplit EE into ED then EAE. Doing so reduces prompt complexity and can raise trigger and argument F1 by ~4–8 points in practice.

Evidence RefSec VI-1; Table II (ChatGPT rows)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
ACE05-EN Trig-C (GPT-4, 5-shot RAE)	81.09	GPT-4 5-shot (no RAE)	+5.18	ACE05-EN test	Table II	Table II
ACE05-EN Arg-C (GPT-4, 5-shot RAE)	58.24	GPT-4 5-shot (no RAE)	+6.29	ACE05-EN test	Table II	Table II

What To Try In 7 Days

Implement a 2-step prompt: ED then EAE for your event schema and test on a small validation set.

Add FAISS-based retrieval using ADA-002 embeddings to feed 3–5 nearest examples into prompts.

Run 5-shot experiments with gpt-3.5-turbo or GPT-4 and compare F1 gains against current extractor.

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

Relies on API-access LLMs (GPT-4/GPT-3.5) which incur cost and privacy concerns.

Long document prompts remain costly and may require large-context models or chunking.

When Not To Use

When you cannot send text to external LLM APIs for privacy or compliance reasons.

When compute or budget prevents frequent large-model API calls.

Failure Modes

Retrieved examples are irrelevant and cause hallucination or wrong labels.

ED errors cascade: wrong event types lead to wrong argument extraction.

Core Entities

Models

GPT-4gpt-3.5-turboLlama2-7BT5-largeRoBERTa-base

Metrics

Trig-CArg-CF1

Datasets

ACE05-ENWIKIEVENTSMaritimeEvent

Context Entities

Models

ChatGPT

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Retrieval-augmented examples (RAE) plus decomposition raises ACE05-EN F1 for GPT-4.

Decomposed prompting meaningfully improves accuracy vs single-step prompts.

Results

What To Try In 7 Days

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Context Entities

Models

You May Also Want to Read

Slot-based Responsible Prompt Engine (RPE) for safer, explainable multimodal health digital twins

Key finding

DAIL-SQL: prompt+example selection that sets a new Spider Text-to-SQL high (86.6% EX)

Key finding

Clear taxonomy and practical survey of persona use in LLMs: role-playing vs personalization

Key finding

Ask-when-Needed (AwN): make LLM agents ask clarifying questions before calling APIs

Key finding

Add a short self-critique and a lightweight refinement step to prompts and get measurably more honest and helpful LLM replies

Key finding