Automate evidence-backed peacebuilding reports with a dynamic RAG pipeline

May 14, 20257 min

Overview

Production Readiness

0.6

Novelty Score

0.5

Cost Impact Score

0.7

Citation Count

0

Authors

Poli A. Nemkova, Suleyman O. Polat, Rafid I. Jahan, Sagnik Ray Choudhury, Sun-joo Lee, Shouryadipta Sarkar, Mark V. Albert

Links

Abstract / PDF

Why It Matters For Business

Automating draft situation reports halves analyst time and scales monitoring coverage while using only open public data, lowering operating costs for NGOs and public agencies.

Summary TLDR

The paper builds a dynamic Retrieval-Augmented Generation (RAG) pipeline that pulls public data (GDELT, ACLED, ReliefWeb, World Bank), embeds text with MiniLM, indexes with FAISS, and prompts LLMs (GPT-4o, LLaMA 3) to produce situation awareness reports for peacebuilding. Evaluation uses three layers: automated metrics (VERISCORE, SummaC, bias detectors), human expert review (UNDP staff), and LLM-as-a-judge (GPT, LLaMA, Claude). On 60 generated reports the system cuts analyst draft time roughly in half (approx. 2 weeks → 1 week) but still needs human review due to redundancy, source bias, and regional coverage gaps. Code and an adapted VERISCORE tool (ragve) are shared on GitHub.

Problem Statement

Manual situation awareness reports take weeks and must combine diverse, real-time sources. Human work is slow and hard to scale. The authors aim to automate initial report drafts while keeping evidence grounding and human oversight.

Main Contribution

A dynamic RAG pipeline that builds query-specific knowledge bases from GDELT, ACLED, ReliefWeb, and World Bank data.

A three-level evaluation framework: automated NLP metrics, human expert review (UNDP), and LLM-as-a-judge comparisons.

A modified VERISCORE implementation (ragve) to verify outputs are grounded in retrieved evidence and the project code released on GitHub.

Key Findings

System produced 60 situation reports from 15 input sets (countries × time windows).

Numbers60 reports from 15 input sets

Automated factuality (VERISCORE) varied by model and prompt; example: LLaMA prompt1 VERISCORE = 0.91 while GPT-4o prompt1 = 0.76.

NumbersVERISCORE: LLaMA p1 0.91; GPT p1 0.76

Human experts rated GPT-generated reports on average 62% (binary metrics) and preferred GPT outputs 76% of the time in pairwise comparisons.

Numbers62% avg binary; 76% preferred

Inter-annotator agreement among two UNDP human experts was moderate: Cohen's Kappa ~0.54–0.57 for top conditions.

NumbersCohen's Kappa ≈ 0.54–0.57

Estimated analyst time reduced by ~50% from up to 2 weeks to about 1 week when using generated drafts as the base.

NumbersTime: 2 weeks → 1 week (~50%)

Results

Total reports generated

Value60 reports (15 input sets × 2 models × 2 prompts)

VERISCORE (average, GPT prompt1)

Value0.76

VERISCORE (average, LLaMA prompt1)

Value0.91

RAG-VERISCORE (example)

Value0.65 (GPT p1)

Human binary evaluation (avg. max score, GPT p1)

Value0.62

Human preference (pairwise) for GPT outputs

Value76% preferred GPT-generated reports

Inter-annotator agreement (Cohen's Kappa)

Value≈0.54–0.57 (moderate)

Who Should Care

What To Try In 7 Days

Wire public APIs (GDELT, ACLED, ReliefWeb, World Bank) into a simple fetch+clean pipeline.

Encode text with MiniLM and index vectors with FAISS for fast retrieval.

Prototype a prompt that asks the LLM to cite sources and produce a short structured report section-by-section.

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Source bias and uneven media coverage can skew reports.
  • LLMs still produce redundant or incomplete sections requiring human cleanup.
  • Human evaluation is subjective and showed only moderate inter-annotator agreement.
  • System lacks forecasting, richer visualizations, and advanced bias mitigation in this version.

When Not To Use

  • For fully automated high-stakes decisions without human review.
  • When input data is proprietary or not publicly accessible.
  • In regions with very sparse media coverage where retrieval will miss events.

Failure Modes

  • Hallucinated facts when evidence is weak or missing.
  • Redundant information repeated across sections (Q4 identified by evaluators).
  • Bias inherited from source datasets and uneven coverage across regions.
  • LLM-as-judge overconfidence: models may rate their own outputs artificially high.

Core Entities

Models

  • GPT-4o
  • LLaMA 3
  • Claude 2
  • MiniLM

Metrics

  • VERISCORE
  • RAG-VERISCORE
  • SummaC
  • politicalBiasBERT
  • Coherence (BERT-based)
  • Cohen's Kappa

Datasets

  • GDELT
  • ACLED
  • ReliefWeb
  • World Bank (API)