Fine-tuning LLaVA VLMs on 50k biomedical image-text pairs cuts hallucinations and improves VQA on LDRT literature

January 26, 20257 min

Overview

Decision SnapshotNeeds Validation

The method is practical and reproducible using open building blocks, but evidence is limited to a single curated domain and evaluation relies on LLM judges and synthetic QA generation.

Citations0

Evidence Strength0.60

Confidence0.82

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/4

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 40%

Novelty: 50%

Authors

Robinson Umeike, Neil Getty, Fangfang Xia, Rick Stevens

Links

Abstract / PDF

Why It Matters For Business

Domain-adapted VLMs give clearer, more factual multimodal answers in specialized scientific domains, lowering risk from hallucinations and improving downstream tasks such as literature review and figure interpretation.

Who Should Care

Summary TLDR

The authors adapt LLaVA vision-language models to a focused biomedical domain (low-dose radiation therapy, LDRT). They collected 42,673 open-access articles, filtered images and captions, and produced 50,882 image–text training pairs plus a 1,574-pair evaluation set. By training a 2-stage pipeline (projector alignment then instruction fine-tuning) with LoRA and memory optimizations, their fine-tuned models score about 1–2 points higher (0–10 judge scale) on VQA judged by large LMs and show a large drop in hedging language ("appears" 1,451→49). Models are stronger at complex reasoning and show fewer hallucinations on the curated LDRT evaluation set, but evaluation relies on LLM-as-judge and L

Problem Statement

General-purpose vision-language models hallucinate and miss domain details when reading scientific figures. The paper asks: can we adapt LLaVA-style VLMs to biomedical literature (LDRT) to improve factuality and multimodal question answering?

Main Contribution

Built an LDRT-focused corpus: 42,673 articles → filtered set of ~150k image-caption pairs → produced 50,882 training and 1,574 evaluation pairs.

Two-stage adaptation of LLaVA: projector alignment (train MLP projector) then instruction fine-tuning of the LLM, keeping vision encoder frozen.

Key Findings

Fine-tuned models score higher on judged VQA (0–10 scale).

NumbersQwen2 judge overall mean 3.465.26; Δ +1.80

Practical UseExpect ~1–2 point accuracy/helpfulness gains on similar domain VQA when adapting LLaVA with domain pairs.

Evidence RefTable 1, Qwen2-72B-Instruct row

Stronger improvement seen under a larger judge.

NumbersLlama3.1 judge overall mean 4.486.69; Δ +2.21

Practical UseImprovements are visible across independent judges, so domain finetuning translates to multiple evaluation models.

Evidence RefTable 1, Llama3.1-70B-Instruct row

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
LLM-as-a-judge overall mean (Qwen2-72B)ours 5.26 / baseline 3.46LLaVA v1.5 overall 3.46 ± 2.53+1.80LDRT evaluation set (1,574 pairs)Table 1 Qwen2-72B-Instruct comparisonTable 1
LLM-as-a-judge overall mean (Llama-3.1-70B)ours 6.69 / baseline 4.48LLaVA v1.5 overall 4.48 ± 2.07+2.21LDRT evaluation set (1,574 pairs)Table 1 Llama3.1-70B-Instruct comparisonTable 1

What To Try In 7 Days

Run PDF→Markdown extraction (Marker) and figure extraction (pdf2figures) on a small internal corpus.

Generate or curate ~5–10k image–caption pairs and make a 5% holdout eval set.

Fine-tune an open LLaVA checkpoint with LoRA on one workstation; test outputs using an LLM judge like Qwen2 or Llama-3.

Optimization Features

Infra Optimization
DeepSpeed ZeRO3 for model/data parallelism
Model Optimization
LoRA
System Optimization
FlashAttention-2 for memory- efficient attention
Training Optimization
Projector alignment then instruction finetuningGradient checkpointing to save activation memory
Inference Optimization
Low temperature (0.2) and max token limits for stable generation

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Evaluation depends on LLM-as-a-judge; judge bias may inflate apparent gains.

Synthetic QA pairs were generated by a large LLM, which can inject artifacts and domain bias.

When Not To Use

Do not deploy for clinical diagnosis or treatment recommendations.

Avoid using as a single source of truth for regulatory or patient-facing outputs.

Failure Modes

Residual hallucination on out-of-domain or unfamiliar figures.

Verbosity or overconfident statements despite factual gaps.

Core Entities

Models

LLaVA v1.5-13BLLaVA v1.6-vicuna-13BCLIP ViT-L/14Qwen2-72B-Instruct (judge)Llama-3.1-70B-Instruct (judge)

Metrics

LLM-as-a-judge score (0-10)ROUGE-1 / ROUGE-2Length ratio to ground truthHedging token counts (e.g., 'appears')

Datasets

LDRT corpus (42,673 articles scraped from Semantic Scholar)50,882 image-text training pairs (derived)1,574 image-text evaluation pairs (held-out)

Benchmarks

LDRT VQA evaluation set (this paper)