Automate evidence-backed peacebuilding reports with a dynamic RAG pipeline

Overview

Decision SnapshotNeeds Validation

The pipeline is practical and tested on real public sources, but the evaluation is limited to 60 reports and requires human-in-the-loop checks due to redundancy and regional data gaps.

Citations0

Evidence Strength0.60

Confidence0.78

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 0/7

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 50%

Authors

Poli A. Nemkova, Suleyman O. Polat, Rafid I. Jahan, Sagnik Ray Choudhury, Sun-joo Lee, Shouryadipta Sarkar, Mark V. Albert

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Automating draft situation reports halves analyst time and scales monitoring coverage while using only open public data, lowering operating costs for NGOs and public agencies.

Who Should Care

Data Scientist ML Engineer Product Manager

Summary TLDR

The paper builds a dynamic Retrieval-Augmented Generation (RAG) pipeline that pulls public data (GDELT, ACLED, ReliefWeb, World Bank), embeds text with MiniLM, indexes with FAISS, and prompts LLMs (GPT-4o, LLaMA 3) to produce situation awareness reports for peacebuilding. Evaluation uses three layers: automated metrics (VERISCORE, SummaC, bias detectors), human expert review (UNDP staff), and LLM-as-a-judge (GPT, LLaMA, Claude). On 60 generated reports the system cuts analyst draft time roughly in half (approx. 2 weeks → 1 week) but still needs human review due to redundancy, source bias, and regional coverage gaps. Code and an adapted VERISCORE tool (ragve) are shared on GitHub.

Problem Statement

Manual situation awareness reports take weeks and must combine diverse, real-time sources. Human work is slow and hard to scale. The authors aim to automate initial report drafts while keeping evidence grounding and human oversight.

Main Contribution

A dynamic RAG pipeline that builds query-specific knowledge bases from GDELT, ACLED, ReliefWeb, and World Bank data.

A three-level evaluation framework: automated NLP metrics, human expert review (UNDP), and LLM-as-a-judge comparisons.

Key Findings

System produced 60 situation reports from 15 input sets (countries × time windows).

Numbers60 reports from 15 input sets

Practical UseYou can generate many draft reports quickly; expect to scale batch creation for multiple countries and periods.

Evidence RefSection 4; Evaluation

Automated factuality (VERISCORE) varied by model and prompt; example: LLaMA prompt1 VERISCORE = 0.91 while GPT-4o prompt1 = 0.76.

NumbersVERISCORE: LLaMA p1 0.91; GPT p1 0.76

Practical UseDifferent LLMs and prompts affect measured factuality. Test model+prompt combos before deploying.

Evidence RefTable 1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Total reports generated	60 reports (15 input sets × 2 models × 2 prompts)	—	—	—	Section 4	Section 4
VERISCORE (average, GPT prompt1)	0.76	—	—	Automated Level 1 metrics	Table 1 VERISCORE	Table 1

What To Try In 7 Days

Wire public APIs (GDELT, ACLED, ReliefWeb, World Bank) into a simple fetch+clean pipeline.

Encode text with MiniLM and index vectors with FAISS for fast retrieval.

Prototype a prompt that asks the LLM to cite sources and produce a short structured report section-by-section.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/withheld-for-anonymity

Data URLs

https://www.gdeltproject.org/https://acleddata.com/https://reliefweb.int/https://pypi.org/project/wbgapi/

Risks & Boundaries

Limitations

Source bias and uneven media coverage can skew reports.

LLMs still produce redundant or incomplete sections requiring human cleanup.

When Not To Use

For fully automated high-stakes decisions without human review.

When input data is proprietary or not publicly accessible.

Failure Modes

Hallucinated facts when evidence is weak or missing.

Redundant information repeated across sections (Q4 identified by evaluators).

Core Entities

Models

GPT-4oLLaMA 3Claude 2MiniLM

Metrics

VERISCORERAG-VERISCORESummaCpoliticalBiasBERTCoherence (BERT-based)Cohen's Kappa

Datasets

GDELTACLEDReliefWebWorld Bank (API)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

System produced 60 situation reports from 15 input sets (countries × time windows).

Automated factuality (VERISCORE) varied by model and prompt; example: LLaMA prompt1 VERISCORE = 0.91 while GPT-4o prompt1 = 0.76.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

You May Also Want to Read

Cross-encoder re-ranking boosts faithfulness of RAG for CDC policy Q&A

Key finding

DomainRAG: a Chinese benchmark testing how RAG helps LLMs solve college-enrollment questions

Key finding

Practical survey of retrieval-augmented generation (RAG): how retrievers, fusion methods, training and benchmarks fit together

Key finding

Domain-specific RAG cuts hallucinated citations in ophthalmology long-form answers

Key finding

A public end-to-end benchmark showing retrieval quality—not the LLM—mostly determines legal RAG performance

Key finding