A practical RAG-based pipeline to turn Kenya's primary-care guidelines into a living LLM benchmark and reasoning stress-tests

July 19, 20258 min

Overview

Production Readiness

0.4

Novelty Score

0.7

Cost Impact Score

0.6

Citation Count

0

Authors

Fred Mutisya, Shikoh Gitau, Christine Syovata, Diana Oigara, Ibrahim Matende, Muna Aden, Munira Ali, Ryan Nyotu, Diana Marion, Job Nyangena, Nasubo Ongoma, Keith Mbae, Elizabeth Wamicha, Eric Mibuari, Jean Philbert Nsengemana, Talkmore Chidede

Links

Abstract / PDF

Why It Matters For Business

Regulators and health-tech vendors can use a guideline-anchored benchmark to audit whether an AI follows local standards, lowering safety risk and speeding approval. The pipeline also creates study material for training and a versioned asset that reduces regulatory friction.

Summary TLDR

The authors present a reproducible pipeline that converts Kenya's Level 2–3 clinical guidelines into a dynamic QA benchmark (English + Kiswahili) using retrieval-augmented generation (RAG). They pair automated question generation with blinded clinician validation and introduce five bespoke evaluation metrics that test reasoning, rare-clue detection, contextual adaptation, patient-persona fidelity, and cognitive-bias resilience. The dataset and metrics are designed to help regulators and implementers audit whether an LLM follows local standard-of-care.

Problem Statement

Existing medical LLM benchmarks are often rooted in Western curricula and miss local protocols, resources, language, and epidemiology. That gap makes it unsafe and hard to evaluate models meant for African primary care. The paper solves this by building guideline-anchored questions and new stress tests that probe local reasoning and safety.

Main Contribution

A reproducible pipeline to digitize, chunk, and index Kenya's national Level 2–3 clinical guidelines and tie each QA item to source lines.

A RAG-based question-generation workflow that produced a large English QA set and ~10% Kiswahili translations, followed by blinded clinician validation using a 5-point rubric.

A suite of five novel evaluation metrics for LLMs: Decision-Points, Needle-in-the-Haystack, Reverse QA (simulated patient), Geographic-Contextual Adaptation, and Cognitive-Bias Stress Test.

Practical tooling choices and automation (Mistral-OCR, hybrid BM25 + vector index, Kobo platform) that reduce manual effort and enable versioning and updates.

Key Findings

The Kenya guideline used spans 636 pages and ~416,000 words, yielding 1,115 semantically indexed chunks.

Numbers636 pages; ~416,000 words; 1,115 chunks

The pipeline produced a benchmark dataset in English with thousands of Q-A items and translated ~10% into Kiswahili.

Numbersdataset size: 'thousands'; Kiswahili ≈ 10%

Automated OCR and parsing (Mistral-OCR) cut manual entry time by over 80% during flow-chart digitization.

Numbers>80% reduction in manual entry time

Generation was tested on GPT-4o mini, Gemini Flash 2.0 Lite, and LLaMA-3.1 (8B); Gemini Flash 2.0 Lite gave the best guideline adherence vs creativity trade-off.

Five new, concrete evaluation metrics were introduced to probe steps of clinical reasoning, rare-clue detection, persona fidelity, geography-aware advice, and cognitive-bias resilience.

Numbers5 named metrics

Who Should Care

What To Try In 7 Days

Digitize one local guideline section and chunk it into retrievable snippets.

Run a small RAG loop: feed a guideline chunk to a chosen LLM and generate 50 MCQs with citations, then have 2 clinicians blind-review them.

Run a single Decision-Points test on 10 vignettes to see whether a model asks guideline-critical questions.

Agent Features

Memory

  • Indexed guideline chunks with metadata and version tags (retrieval memory)

Planning

  • Pipeline orchestration for RAG generation and validation

Tool Use

  • Mistral-OCR for algorithm extraction
  • BM25 and vector search for retrieval
  • Kobo Collect for blinded clinician review
  • LLM APIs for generation (Gemini/GPT-4o)

Frameworks

  • Retrieval-augmented generation pipeline

Collaboration

  • Human-in-the-loop co-creation and blinded expert validation

Optimization Features

Token Efficiency

  • Chunking and metadata tags to limit prompt size

System Optimization

  • Hybrid retrieval to reduce irrelevant context in prompts

Reproducibility

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Benchmark quality depends on guideline accuracy; outdated or incomplete guidelines propagate errors.
  • Current work focuses on Kenya Level 2–3 primary care and is not suited to tertiary specialist practice.
  • Some automated steps (OCR, synthetic vignette generation) still need human review to catch edge-case errors.
  • Custom composite metrics and weighting choices add subjectivity and need community validation.
  • For this proof-of-concept, the explicit retrieval step was skipped when generating the initial QA set, possibly reducing retrieval realism.

When Not To Use

  • To certify models for tertiary-care or specialist workflows that require different guidelines.
  • When up-to-date local guideline documents are unavailable or untrustworthy.
  • As the sole safety check for deployed patient-facing systems without live clinician oversight.

Failure Modes

  • Propagating outdated or incorrect guideline recommendations into model outputs.
  • Hallucinations if retrieval fails or incorrect chunks are fed to the LLM.
  • Context-insensitive answers (generic plans that ignore local resources or policy).
  • Translation inaccuracies in Kiswahili versions that alter clinical meaning.
  • Reviewer bias or inconsistency despite blinded review.

Core Entities

Models

  • Gemini Flash 2.0 Lite
  • GPT-4o mini
  • LLaMA-3.1 (8B)

Metrics

  • Decision-Points Score
  • Needle-in-the-Haystack Score
  • Reverse-QA Persona Score
  • Context Adaptation Score (CAS)
  • Cognitive-Bias Stress Test (CBST)

Datasets

  • Alama Health QA dataset (Kenya Level 2–3 primary care)
  • Kenya MoH Clinical Guidelines (Level 2–3, 2024)

Benchmarks

  • AfriMed-QA
  • HealthBench
  • MedMCQA
  • MultiMedQA

Context Entities

Datasets

  • WHO guidelines / national EPI schedules (used for context checks)
  • AfriMed-QA (comparison)