Overview
The pipeline is practical and tested on real public sources, but the evaluation is limited to 60 reports and requires human-in-the-loop checks due to redundancy and regional data gaps.
Citations0
Evidence Strength0.60
Confidence0.78
Risk Signals11
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 0/7
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 70%
Production readiness: 60%
Novelty: 50%
Why It Matters For Business
Automating draft situation reports halves analyst time and scales monitoring coverage while using only open public data, lowering operating costs for NGOs and public agencies.
Who Should Care
Summary TLDR
The paper builds a dynamic Retrieval-Augmented Generation (RAG) pipeline that pulls public data (GDELT, ACLED, ReliefWeb, World Bank), embeds text with MiniLM, indexes with FAISS, and prompts LLMs (GPT-4o, LLaMA 3) to produce situation awareness reports for peacebuilding. Evaluation uses three layers: automated metrics (VERISCORE, SummaC, bias detectors), human expert review (UNDP staff), and LLM-as-a-judge (GPT, LLaMA, Claude). On 60 generated reports the system cuts analyst draft time roughly in half (approx. 2 weeks → 1 week) but still needs human review due to redundancy, source bias, and regional coverage gaps. Code and an adapted VERISCORE tool (ragve) are shared on GitHub.
Problem Statement
Manual situation awareness reports take weeks and must combine diverse, real-time sources. Human work is slow and hard to scale. The authors aim to automate initial report drafts while keeping evidence grounding and human oversight.
Main Contribution
A dynamic RAG pipeline that builds query-specific knowledge bases from GDELT, ACLED, ReliefWeb, and World Bank data.
A three-level evaluation framework: automated NLP metrics, human expert review (UNDP), and LLM-as-a-judge comparisons.
Key Findings
System produced 60 situation reports from 15 input sets (countries × time windows).
Automated factuality (VERISCORE) varied by model and prompt; example: LLaMA prompt1 VERISCORE = 0.91 while GPT-4o prompt1 = 0.76.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Total reports generated | 60 reports (15 input sets × 2 models × 2 prompts) | — | — | — | Section 4 | Section 4 |
| VERISCORE (average, GPT prompt1) | 0.76 | — | — | Automated Level 1 metrics | Table 1 VERISCORE | Table 1 |
What To Try In 7 Days
Wire public APIs (GDELT, ACLED, ReliefWeb, World Bank) into a simple fetch+clean pipeline.
Encode text with MiniLM and index vectors with FAISS for fast retrieval.
Prototype a prompt that asks the LLM to cite sources and produce a short structured report section-by-section.
Reproducibility
Risks & Boundaries
Limitations
Source bias and uneven media coverage can skew reports.
LLMs still produce redundant or incomplete sections requiring human cleanup.
When Not To Use
For fully automated high-stakes decisions without human review.
When input data is proprietary or not publicly accessible.
Failure Modes
Hallucinated facts when evidence is weak or missing.
Redundant information repeated across sections (Q4 identified by evaluators).

