Break event extraction into detect+extract and add schema-aware retrieval to cut hallucination and raise F1

June 3, 20246 min

Overview

Decision SnapshotNeeds Validation

Method is practically ready for prototyping (works with API LLMs and FAISS) but relies on proprietary models and prompt engineering; results are consistent across datasets but large-document and privacy constraints add deployment costs.

Citations1

Evidence Strength0.60

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 5/6

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 40%

Production readiness: 60%

Novelty: 60%

Authors

Fatemeh Shiri, Van Nguyen, Farhad Moghimifar, John Yoo, Gholamreza Haffari, Yuan-Fang Li

Links

Abstract / PDF

Why It Matters For Business

Decomposed, retrieval-enhanced prompting gives more accurate structured events without fine-tuning, reducing manual labeling and improving downstream dashboards and knowledge graphs in days rather than months.

Who Should Care

Summary TLDR

The paper proposes a two-step, prompt-based pipeline for event extraction with LLMs: (1) Event Detection (ED) to find triggers and types, then (2) Event Argument Extraction (EAE) for role filling. Prompts are enriched with precise schema, extraction rules, output format and retrieval-augmented examples (RAE) fetched via FAISS embeddings. On ACE05-EN, WikiEvents and a synthetic MaritimeEvent (~10k samples) the approach improves F1 over plain few-shot and prior LLM prompting, e.g., GPT-4 5-shot+RAE achieves Trig-C/Arg-C 81.09/58.24 on ACE05-EN and 84.32/60.79 on MaritimeEvent. ADA-002 embeddings worked best for retrieval. The method reduces hallucination risk but needs prompt engineering and L

Problem Statement

LLMs can extract structured events from text but often hallucinate or miss details when prompts are long or generic. The challenge is to get accurate triggers, event types, and argument roles from documents without large supervised fine-tuning.

Main Contribution

A two-step prompt pipeline that decomposes event extraction into Event Detection and Event Argument Extraction.

Schema-aware, granular prompts that include extraction rules, output format, and dynamic retrieval-augmented examples.

Key Findings

Retrieval-augmented examples (RAE) plus decomposition raises ACE05-EN F1 for GPT-4.

Numbers+5.18 Trig-C, +6.29 Arg-C (GPT-4, 5-shot → 5-shot+RAE on ACE05-EN)

Practical UseIf you can add instance-level retrieval to prompts, expect several-point F1 gains on sentence-level EE with GPT-4; implement RAE for higher accuracy.

Evidence RefTable II (ACE05-EN, GPT-4 5-shot vs 5-shot RAE)

Decomposed prompting meaningfully improves accuracy vs single-step prompts.

Numbers+8.3 Trig-C, +4.64 Arg-C (ChatGPT, 5-shot RAE, ACE05-EN)

Practical UseSplit EE into ED then EAE. Doing so reduces prompt complexity and can raise trigger and argument F1 by ~4–8 points in practice.

Evidence RefSec VI-1; Table II (ChatGPT rows)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
ACE05-EN Trig-C (GPT-4, 5-shot RAE)81.09GPT-4 5-shot (no RAE)+5.18ACE05-EN testTable IITable II
ACE05-EN Arg-C (GPT-4, 5-shot RAE)58.24GPT-4 5-shot (no RAE)+6.29ACE05-EN testTable IITable II

What To Try In 7 Days

Implement a 2-step prompt: ED then EAE for your event schema and test on a small validation set.

Add FAISS-based retrieval using ADA-002 embeddings to feed 3–5 nearest examples into prompts.

Run 5-shot experiments with gpt-3.5-turbo or GPT-4 and compare F1 gains against current extractor.

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

Relies on API-access LLMs (GPT-4/GPT-3.5) which incur cost and privacy concerns.

Long document prompts remain costly and may require large-context models or chunking.

When Not To Use

When you cannot send text to external LLM APIs for privacy or compliance reasons.

When compute or budget prevents frequent large-model API calls.

Failure Modes

Retrieved examples are irrelevant and cause hallucination or wrong labels.

ED errors cascade: wrong event types lead to wrong argument extraction.

Core Entities

Models

GPT-4gpt-3.5-turboLlama2-7BT5-largeRoBERTa-base

Metrics

Trig-CArg-CF1

Datasets

ACE05-ENWIKIEVENTSMaritimeEvent

Context Entities

Models

ChatGPT