A practical RAG-based pipeline to turn Kenya's primary-care guidelines into a living LLM benchmark and reasoning stress-tests

Overview

Decision SnapshotNeeds Validation

The work provides a clear, practical pipeline (novel for Kenyan primary care). It is methodologically strong but remains a proof-of-concept: dataset and scripts are not yet fully open, and no full model benchmarking results are published here.

Citations0

Evidence Strength0.60

Confidence0.80

Risk Signals13

Trust Signals

Findings with numeric evidence: 4/5

Findings with evidence refs: 5/5

Results with explicit delta: 0/0

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 40%

Novelty: 70%

Authors

Fred Mutisya, Shikoh Gitau, Christine Syovata, Diana Oigara, Ibrahim Matende, Muna Aden, Munira Ali, Ryan Nyotu, Diana Marion, Job Nyangena, Nasubo Ongoma, Keith Mbae, Elizabeth Wamicha, Eric Mibuari, Jean Philbert Nsengemana, Talkmore Chidede

Links

Abstract / PDF

Why It Matters For Business

Regulators and health-tech vendors can use a guideline-anchored benchmark to audit whether an AI follows local standards, lowering safety risk and speeding approval. The pipeline also creates study material for training and a versioned asset that reduces regulatory friction.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Data Scientist Founder

Summary TLDR

The authors present a reproducible pipeline that converts Kenya's Level 2–3 clinical guidelines into a dynamic QA benchmark (English + Kiswahili) using retrieval-augmented generation (RAG). They pair automated question generation with blinded clinician validation and introduce five bespoke evaluation metrics that test reasoning, rare-clue detection, contextual adaptation, patient-persona fidelity, and cognitive-bias resilience. The dataset and metrics are designed to help regulators and implementers audit whether an LLM follows local standard-of-care.

Problem Statement

Existing medical LLM benchmarks are often rooted in Western curricula and miss local protocols, resources, language, and epidemiology. That gap makes it unsafe and hard to evaluate models meant for African primary care. The paper solves this by building guideline-anchored questions and new stress tests that probe local reasoning and safety.

Main Contribution

A reproducible pipeline to digitize, chunk, and index Kenya's national Level 2–3 clinical guidelines and tie each QA item to source lines.

A RAG-based question-generation workflow that produced a large English QA set and ~10% Kiswahili translations, followed by blinded clinician validation using a 5-point rubric.

Key Findings

The Kenya guideline used spans 636 pages and ~416,000 words, yielding 1,115 semantically indexed chunks.

Numbers636 pages; ~416,000 words; 1,115 chunks

Practical UseIndexing the full guideline enables traceable question–answer pairs and makes automated retrieval and targeted updates feasible.

Evidence RefSection 2.2 Knowledge Base Construction

The pipeline produced a benchmark dataset in English with thousands of Q-A items and translated ~10% into Kiswahili.

Numbersdataset size: 'thousands'; Kiswahili ≈ 10%

Practical UseYou can evaluate models on both English and local-language clinical prompts; translate a focused subset for patient-facing tasks to save effort.

Evidence RefAbstract; Section 2.1 Co-Creation; Dataset characteristics

What To Try In 7 Days

Digitize one local guideline section and chunk it into retrievable snippets.

Run a small RAG loop: feed a guideline chunk to a chosen LLM and generate 50 MCQs with citations, then have 2 clinicians blind-review them.

Run a single Decision-Points test on 10 vignettes to see whether a model asks guideline-critical questions.

Agent Features

Memory

Indexed guideline chunks with metadata and version tags (retrieval memory)

Planning

Pipeline orchestration for RAG generation and validation

Tool Use

Mistral-OCR for algorithm extractionBM25 and vector search for retrievalKobo Collect for blinded clinician reviewLLM APIs for generation (Gemini/GPT-4o)

Frameworks

Retrieval-augmented generation pipeline

Collaboration

Human-in-the-loop co-creation and blinded expert validation

Optimization Features

Token Efficiency

Chunking and metadata tags to limit prompt size

System Optimization

Hybrid retrieval to reduce irrelevant context in prompts

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

Benchmark quality depends on guideline accuracy; outdated or incomplete guidelines propagate errors.

Current work focuses on Kenya Level 2–3 primary care and is not suited to tertiary specialist practice.

When Not To Use

To certify models for tertiary-care or specialist workflows that require different guidelines.

When up-to-date local guideline documents are unavailable or untrustworthy.

Failure Modes

Propagating outdated or incorrect guideline recommendations into model outputs.

Hallucinations if retrieval fails or incorrect chunks are fed to the LLM.

Core Entities

Models

Gemini Flash 2.0 LiteGPT-4o miniLLaMA-3.1 (8B)

Metrics

Decision-Points ScoreNeedle-in-the-Haystack ScoreReverse-QA Persona ScoreContext Adaptation Score (CAS)Cognitive-Bias Stress Test (CBST)

Datasets

Alama Health QA dataset (Kenya Level 2–3 primary care)Kenya MoH Clinical Guidelines (Level 2–3, 2024)

Benchmarks

AfriMed-QAHealthBenchMedMCQAMultiMedQA

Context Entities

Datasets

WHO guidelines / national EPI schedules (used for context checks)AfriMed-QA (comparison)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

The Kenya guideline used spans 636 pages and ~416,000 words, yielding 1,115 semantically indexed chunks.

The pipeline produced a benchmark dataset in English with thousands of Q-A items and translated ~10% into Kiswahili.

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Datasets

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding