Overview
The work provides a clear, practical pipeline (novel for Kenyan primary care). It is methodologically strong but remains a proof-of-concept: dataset and scripts are not yet fully open, and no full model benchmarking results are published here.
Citations0
Evidence Strength0.60
Confidence0.80
Risk Signals13
Trust Signals
Findings with numeric evidence: 4/5
Findings with evidence refs: 5/5
Results with explicit delta: 0/0
Reproducibility
Status: No open assets linked
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 40%
Novelty: 70%
Why It Matters For Business
Regulators and health-tech vendors can use a guideline-anchored benchmark to audit whether an AI follows local standards, lowering safety risk and speeding approval. The pipeline also creates study material for training and a versioned asset that reduces regulatory friction.
Who Should Care
Summary TLDR
The authors present a reproducible pipeline that converts Kenya's Level 2–3 clinical guidelines into a dynamic QA benchmark (English + Kiswahili) using retrieval-augmented generation (RAG). They pair automated question generation with blinded clinician validation and introduce five bespoke evaluation metrics that test reasoning, rare-clue detection, contextual adaptation, patient-persona fidelity, and cognitive-bias resilience. The dataset and metrics are designed to help regulators and implementers audit whether an LLM follows local standard-of-care.
Problem Statement
Existing medical LLM benchmarks are often rooted in Western curricula and miss local protocols, resources, language, and epidemiology. That gap makes it unsafe and hard to evaluate models meant for African primary care. The paper solves this by building guideline-anchored questions and new stress tests that probe local reasoning and safety.
Main Contribution
A reproducible pipeline to digitize, chunk, and index Kenya's national Level 2–3 clinical guidelines and tie each QA item to source lines.
A RAG-based question-generation workflow that produced a large English QA set and ~10% Kiswahili translations, followed by blinded clinician validation using a 5-point rubric.
Key Findings
The Kenya guideline used spans 636 pages and ~416,000 words, yielding 1,115 semantically indexed chunks.
The pipeline produced a benchmark dataset in English with thousands of Q-A items and translated ~10% into Kiswahili.
What To Try In 7 Days
Digitize one local guideline section and chunk it into retrievable snippets.
Run a small RAG loop: feed a guideline chunk to a chosen LLM and generate 50 MCQs with citations, then have 2 clinicians blind-review them.
Run a single Decision-Points test on 10 vignettes to see whether a model asks guideline-critical questions.
Agent Features
Memory
Planning
Tool Use
Frameworks
Collaboration
Optimization Features
Token Efficiency
System Optimization
Reproducibility
Risks & Boundaries
Limitations
Benchmark quality depends on guideline accuracy; outdated or incomplete guidelines propagate errors.
Current work focuses on Kenya Level 2–3 primary care and is not suited to tertiary specialist practice.
When Not To Use
To certify models for tertiary-care or specialist workflows that require different guidelines.
When up-to-date local guideline documents are unavailable or untrustworthy.
Failure Modes
Propagating outdated or incorrect guideline recommendations into model outputs.
Hallucinations if retrieval fails or incorrect chunks are fed to the LLM.

