Automate evidence-backed peacebuilding reports with a dynamic RAG pipeline

May 14, 20257 min

Overview

Decision SnapshotNeeds Validation

The pipeline is practical and tested on real public sources, but the evaluation is limited to 60 reports and requires human-in-the-loop checks due to redundancy and regional data gaps.

Citations0

Evidence Strength0.60

Confidence0.78

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 0/7

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 50%

Authors

Poli A. Nemkova, Suleyman O. Polat, Rafid I. Jahan, Sagnik Ray Choudhury, Sun-joo Lee, Shouryadipta Sarkar, Mark V. Albert

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Automating draft situation reports halves analyst time and scales monitoring coverage while using only open public data, lowering operating costs for NGOs and public agencies.

Who Should Care

Summary TLDR

The paper builds a dynamic Retrieval-Augmented Generation (RAG) pipeline that pulls public data (GDELT, ACLED, ReliefWeb, World Bank), embeds text with MiniLM, indexes with FAISS, and prompts LLMs (GPT-4o, LLaMA 3) to produce situation awareness reports for peacebuilding. Evaluation uses three layers: automated metrics (VERISCORE, SummaC, bias detectors), human expert review (UNDP staff), and LLM-as-a-judge (GPT, LLaMA, Claude). On 60 generated reports the system cuts analyst draft time roughly in half (approx. 2 weeks → 1 week) but still needs human review due to redundancy, source bias, and regional coverage gaps. Code and an adapted VERISCORE tool (ragve) are shared on GitHub.

Problem Statement

Manual situation awareness reports take weeks and must combine diverse, real-time sources. Human work is slow and hard to scale. The authors aim to automate initial report drafts while keeping evidence grounding and human oversight.

Main Contribution

A dynamic RAG pipeline that builds query-specific knowledge bases from GDELT, ACLED, ReliefWeb, and World Bank data.

A three-level evaluation framework: automated NLP metrics, human expert review (UNDP), and LLM-as-a-judge comparisons.

Key Findings

System produced 60 situation reports from 15 input sets (countries × time windows).

Numbers60 reports from 15 input sets

Practical UseYou can generate many draft reports quickly; expect to scale batch creation for multiple countries and periods.

Evidence RefSection 4; Evaluation

Automated factuality (VERISCORE) varied by model and prompt; example: LLaMA prompt1 VERISCORE = 0.91 while GPT-4o prompt1 = 0.76.

NumbersVERISCORE: LLaMA p1 0.91; GPT p1 0.76

Practical UseDifferent LLMs and prompts affect measured factuality. Test model+prompt combos before deploying.

Evidence RefTable 1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Total reports generated60 reports (15 input sets × 2 models × 2 prompts)Section 4Section 4
VERISCORE (average, GPT prompt1)0.76Automated Level 1 metricsTable 1 VERISCORETable 1

What To Try In 7 Days

Wire public APIs (GDELT, ACLED, ReliefWeb, World Bank) into a simple fetch+clean pipeline.

Encode text with MiniLM and index vectors with FAISS for fast retrieval.

Prototype a prompt that asks the LLM to cite sources and produce a short structured report section-by-section.

Reproducibility

Risks & Boundaries

Limitations

Source bias and uneven media coverage can skew reports.

LLMs still produce redundant or incomplete sections requiring human cleanup.

When Not To Use

For fully automated high-stakes decisions without human review.

When input data is proprietary or not publicly accessible.

Failure Modes

Hallucinated facts when evidence is weak or missing.

Redundant information repeated across sections (Q4 identified by evaluators).

Core Entities

Models

GPT-4oLLaMA 3Claude 2MiniLM

Metrics

VERISCORERAG-VERISCORESummaCpoliticalBiasBERTCoherence (BERT-based)Cohen's Kappa

Datasets

GDELTACLEDReliefWebWorld Bank (API)