Break event extraction into detect+extract and add schema-aware retrieval to cut hallucination and raise F1

June 3, 20246 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.4

Citation Count

1

Authors

Fatemeh Shiri, Van Nguyen, Farhad Moghimifar, John Yoo, Gholamreza Haffari, Yuan-Fang Li

Links

Abstract / PDF

Why It Matters For Business

Decomposed, retrieval-enhanced prompting gives more accurate structured events without fine-tuning, reducing manual labeling and improving downstream dashboards and knowledge graphs in days rather than months.

Summary TLDR

The paper proposes a two-step, prompt-based pipeline for event extraction with LLMs: (1) Event Detection (ED) to find triggers and types, then (2) Event Argument Extraction (EAE) for role filling. Prompts are enriched with precise schema, extraction rules, output format and retrieval-augmented examples (RAE) fetched via FAISS embeddings. On ACE05-EN, WikiEvents and a synthetic MaritimeEvent (~10k samples) the approach improves F1 over plain few-shot and prior LLM prompting, e.g., GPT-4 5-shot+RAE achieves Trig-C/Arg-C 81.09/58.24 on ACE05-EN and 84.32/60.79 on MaritimeEvent. ADA-002 embeddings worked best for retrieval. The method reduces hallucination risk but needs prompt engineering and L

Problem Statement

LLMs can extract structured events from text but often hallucinate or miss details when prompts are long or generic. The challenge is to get accurate triggers, event types, and argument roles from documents without large supervised fine-tuning.

Main Contribution

A two-step prompt pipeline that decomposes event extraction into Event Detection and Event Argument Extraction.

Schema-aware, granular prompts that include extraction rules, output format, and dynamic retrieval-augmented examples.

A synthetic MaritimeEvent dataset (~10k examples) for a maritime domain evaluation.

Empirical evidence that retrieval-augmented examples and prompt decomposition improve F1 vs simple few-shot prompting and prior LLM prompting baselines.

Key Findings

Retrieval-augmented examples (RAE) plus decomposition raises ACE05-EN F1 for GPT-4.

Numbers+5.18 Trig-C, +6.29 Arg-C (GPT-4, 5-shot → 5-shot+RAE on ACE05-EN)

Decomposed prompting meaningfully improves accuracy vs single-step prompts.

Numbers+8.3 Trig-C, +4.64 Arg-C (ChatGPT, 5-shot RAE, ACE05-EN)

Embedding choice matters: ADA-002 produced the best retrieval results.

NumbersADA-002 vs USE: +3.57 Trig-C, +1.46 Arg-C (ACE05-EN, GPT-4)

Results

ACE05-EN Trig-C (GPT-4, 5-shot RAE)

Value81.09

BaselineGPT-4 5-shot (no RAE)

ACE05-EN Arg-C (GPT-4, 5-shot RAE)

Value58.24

BaselineGPT-4 5-shot (no RAE)

MaritimeEvent Trig-C (GPT-4, 5-shot RAE)

Value84.32

BaselineGPT-4 5-shot (no RAE)

MaritimeEvent Arg-C (GPT-4, 5-shot RAE)

Value60.79

BaselineGPT-4 5-shot (no RAE)

WikiEvent Trig-C (GPT-4, 5-shot RAE)

Value64.65

BaselineGPT-4 5-shot (no RAE)

Text2Event (T5-large) Trig-C / Arg-C (ACE05-EN)

Value71.9 / 53.8

Who Should Care

What To Try In 7 Days

Implement a 2-step prompt: ED then EAE for your event schema and test on a small validation set.

Add FAISS-based retrieval using ADA-002 embeddings to feed 3–5 nearest examples into prompts.

Run 5-shot experiments with gpt-3.5-turbo or GPT-4 and compare F1 gains against current extractor.

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Relies on API-access LLMs (GPT-4/GPT-3.5) which incur cost and privacy concerns.
  • Long document prompts remain costly and may require large-context models or chunking.
  • MaritimeEvent is synthetic (ChatGPT-generated) and may not reflect real-world distribution.
  • WikiEvents has limited training data, reducing retrieval benefits there.

When Not To Use

  • When you cannot send text to external LLM APIs for privacy or compliance reasons.
  • When compute or budget prevents frequent large-model API calls.
  • For extremely long documents without access to large-context LLMs or careful chunking.

Failure Modes

  • Retrieved examples are irrelevant and cause hallucination or wrong labels.
  • ED errors cascade: wrong event types lead to wrong argument extraction.
  • Prompt truncation or token limits drop essential schema or examples.

Core Entities

Models

  • GPT-4
  • gpt-3.5-turbo
  • Llama2-7B
  • T5-large
  • RoBERTa-base

Metrics

  • Trig-C
  • Arg-C
  • F1

Datasets

  • ACE05-EN
  • WIKIEVENTS
  • MaritimeEvent

Context Entities

Models

  • ChatGPT