A practical RAG-based pipeline to turn Kenya's primary-care guidelines into a living LLM benchmark and reasoning stress-tests

July 19, 20258 min

Overview

Decision SnapshotNeeds Validation

The work provides a clear, practical pipeline (novel for Kenyan primary care). It is methodologically strong but remains a proof-of-concept: dataset and scripts are not yet fully open, and no full model benchmarking results are published here.

Citations0

Evidence Strength0.60

Confidence0.80

Risk Signals13

Trust Signals

Findings with numeric evidence: 4/5

Findings with evidence refs: 5/5

Results with explicit delta: 0/0

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 40%

Novelty: 70%

Authors

Fred Mutisya, Shikoh Gitau, Christine Syovata, Diana Oigara, Ibrahim Matende, Muna Aden, Munira Ali, Ryan Nyotu, Diana Marion, Job Nyangena, Nasubo Ongoma, Keith Mbae, Elizabeth Wamicha, Eric Mibuari, Jean Philbert Nsengemana, Talkmore Chidede

Links

Abstract / PDF

Why It Matters For Business

Regulators and health-tech vendors can use a guideline-anchored benchmark to audit whether an AI follows local standards, lowering safety risk and speeding approval. The pipeline also creates study material for training and a versioned asset that reduces regulatory friction.

Who Should Care

Summary TLDR

The authors present a reproducible pipeline that converts Kenya's Level 2–3 clinical guidelines into a dynamic QA benchmark (English + Kiswahili) using retrieval-augmented generation (RAG). They pair automated question generation with blinded clinician validation and introduce five bespoke evaluation metrics that test reasoning, rare-clue detection, contextual adaptation, patient-persona fidelity, and cognitive-bias resilience. The dataset and metrics are designed to help regulators and implementers audit whether an LLM follows local standard-of-care.

Problem Statement

Existing medical LLM benchmarks are often rooted in Western curricula and miss local protocols, resources, language, and epidemiology. That gap makes it unsafe and hard to evaluate models meant for African primary care. The paper solves this by building guideline-anchored questions and new stress tests that probe local reasoning and safety.

Main Contribution

A reproducible pipeline to digitize, chunk, and index Kenya's national Level 2–3 clinical guidelines and tie each QA item to source lines.

A RAG-based question-generation workflow that produced a large English QA set and ~10% Kiswahili translations, followed by blinded clinician validation using a 5-point rubric.

Key Findings

The Kenya guideline used spans 636 pages and ~416,000 words, yielding 1,115 semantically indexed chunks.

Numbers636 pages; ~416,000 words; 1,115 chunks

Practical UseIndexing the full guideline enables traceable question–answer pairs and makes automated retrieval and targeted updates feasible.

Evidence RefSection 2.2 Knowledge Base Construction

The pipeline produced a benchmark dataset in English with thousands of Q-A items and translated ~10% into Kiswahili.

Numbersdataset size: 'thousands'; Kiswahili ≈ 10%

Practical UseYou can evaluate models on both English and local-language clinical prompts; translate a focused subset for patient-facing tasks to save effort.

Evidence RefAbstract; Section 2.1 Co-Creation; Dataset characteristics

What To Try In 7 Days

Digitize one local guideline section and chunk it into retrievable snippets.

Run a small RAG loop: feed a guideline chunk to a chosen LLM and generate 50 MCQs with citations, then have 2 clinicians blind-review them.

Run a single Decision-Points test on 10 vignettes to see whether a model asks guideline-critical questions.

Agent Features

Memory
Indexed guideline chunks with metadata and version tags (retrieval memory)
Planning
Pipeline orchestration for RAG generation and validation
Tool Use
Mistral-OCR for algorithm extractionBM25 and vector search for retrievalKobo Collect for blinded clinician reviewLLM APIs for generation (Gemini/GPT-4o)
Frameworks
Retrieval-augmented generation pipeline
Collaboration
Human-in-the-loop co-creation and blinded expert validation

Optimization Features

Token Efficiency
Chunking and metadata tags to limit prompt size
System Optimization
Hybrid retrieval to reduce irrelevant context in prompts

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Benchmark quality depends on guideline accuracy; outdated or incomplete guidelines propagate errors.

Current work focuses on Kenya Level 2–3 primary care and is not suited to tertiary specialist practice.

When Not To Use

To certify models for tertiary-care or specialist workflows that require different guidelines.

When up-to-date local guideline documents are unavailable or untrustworthy.

Failure Modes

Propagating outdated or incorrect guideline recommendations into model outputs.

Hallucinations if retrieval fails or incorrect chunks are fed to the LLM.

Core Entities

Models

Gemini Flash 2.0 LiteGPT-4o miniLLaMA-3.1 (8B)

Metrics

Decision-Points ScoreNeedle-in-the-Haystack ScoreReverse-QA Persona ScoreContext Adaptation Score (CAS)Cognitive-Bias Stress Test (CBST)

Datasets

Alama Health QA dataset (Kenya Level 2–3 primary care)Kenya MoH Clinical Guidelines (Level 2–3, 2024)

Benchmarks

AfriMed-QAHealthBenchMedMCQAMultiMedQA

Context Entities

Datasets

WHO guidelines / national EPI schedules (used for context checks)AfriMed-QA (comparison)