Fine-tuning LLaVA VLMs on 50k biomedical image-text pairs cuts hallucinations and improves VQA on LDRT literature

Overview

Decision SnapshotNeeds Validation

The method is practical and reproducible using open building blocks, but evidence is limited to a single curated domain and evaluation relies on LLM judges and synthetic QA generation.

Citations0

Evidence Strength0.60

Confidence0.82

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/4

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 40%

Novelty: 50%

Authors

Robinson Umeike, Neil Getty, Fangfang Xia, Rick Stevens

Links

Abstract / PDF

Why It Matters For Business

Domain-adapted VLMs give clearer, more factual multimodal answers in specialized scientific domains, lowering risk from hallucinations and improving downstream tasks such as literature review and figure interpretation.

Who Should Care

CTO Product Manager ML Engineer Data Scientist

Summary TLDR

The authors adapt LLaVA vision-language models to a focused biomedical domain (low-dose radiation therapy, LDRT). They collected 42,673 open-access articles, filtered images and captions, and produced 50,882 image–text training pairs plus a 1,574-pair evaluation set. By training a 2-stage pipeline (projector alignment then instruction fine-tuning) with LoRA and memory optimizations, their fine-tuned models score about 1–2 points higher (0–10 judge scale) on VQA judged by large LMs and show a large drop in hedging language ("appears" 1,451→49). Models are stronger at complex reasoning and show fewer hallucinations on the curated LDRT evaluation set, but evaluation relies on LLM-as-judge and L

Problem Statement

General-purpose vision-language models hallucinate and miss domain details when reading scientific figures. The paper asks: can we adapt LLaVA-style VLMs to biomedical literature (LDRT) to improve factuality and multimodal question answering?

Main Contribution

Built an LDRT-focused corpus: 42,673 articles → filtered set of ~150k image-caption pairs → produced 50,882 training and 1,574 evaluation pairs.

Two-stage adaptation of LLaVA: projector alignment (train MLP projector) then instruction fine-tuning of the LLM, keeping vision encoder frozen.

Key Findings

Fine-tuned models score higher on judged VQA (0–10 scale).

NumbersQwen2 judge overall mean 3.46 → 5.26; Δ +1.80

Practical UseExpect ~1–2 point accuracy/helpfulness gains on similar domain VQA when adapting LLaVA with domain pairs.

Evidence RefTable 1, Qwen2-72B-Instruct row

Stronger improvement seen under a larger judge.

NumbersLlama3.1 judge overall mean 4.48 → 6.69; Δ +2.21

Practical UseImprovements are visible across independent judges, so domain finetuning translates to multiple evaluation models.

Evidence RefTable 1, Llama3.1-70B-Instruct row

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
LLM-as-a-judge overall mean (Qwen2-72B)	ours 5.26 / baseline 3.46	LLaVA v1.5 overall 3.46 ± 2.53	+1.80	LDRT evaluation set (1,574 pairs)	Table 1 Qwen2-72B-Instruct comparison	Table 1
LLM-as-a-judge overall mean (Llama-3.1-70B)	ours 6.69 / baseline 4.48	LLaVA v1.5 overall 4.48 ± 2.07	+2.21	LDRT evaluation set (1,574 pairs)	Table 1 Llama3.1-70B-Instruct comparison	Table 1

What To Try In 7 Days

Run PDF→Markdown extraction (Marker) and figure extraction (pdf2figures) on a small internal corpus.

Generate or curate ~5–10k image–caption pairs and make a 5% holdout eval set.

Fine-tune an open LLaVA checkpoint with LoRA on one workstation; test outputs using an LLM judge like Qwen2 or Llama-3.

Optimization Features

Infra Optimization

DeepSpeed ZeRO3 for model/data parallelism

Model Optimization

LoRA

System Optimization

FlashAttention-2 for memory- efficient attention

Training Optimization

Projector alignment then instruction finetuningGradient checkpointing to save activation memory

Inference Optimization

Low temperature (0.2) and max token limits for stable generation

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

Evaluation depends on LLM-as-a-judge; judge bias may inflate apparent gains.

Synthetic QA pairs were generated by a large LLM, which can inject artifacts and domain bias.

When Not To Use

Do not deploy for clinical diagnosis or treatment recommendations.

Avoid using as a single source of truth for regulatory or patient-facing outputs.

Failure Modes

Residual hallucination on out-of-domain or unfamiliar figures.

Verbosity or overconfident statements despite factual gaps.

Core Entities

Models

LLaVA v1.5-13BLLaVA v1.6-vicuna-13BCLIP ViT-L/14Qwen2-72B-Instruct (judge)Llama-3.1-70B-Instruct (judge)

Metrics

LLM-as-a-judge score (0-10)ROUGE-1 / ROUGE-2Length ratio to ground truthHedging token counts (e.g., 'appears')

Datasets

LDRT corpus (42,673 articles scraped from Semantic Scholar)50,882 image-text training pairs (derived)1,574 image-text evaluation pairs (held-out)

Benchmarks

LDRT VQA evaluation set (this paper)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Fine-tuned models score higher on judged VQA (0–10 scale).

Stronger improvement seen under a larger judge.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

HOPE: search image-specific, highly misleading distractors to better expose object hallucinations in LVLMs

Key finding

Survey of multimodal RAG: methods, datasets, benchmarks, and open problems

Key finding

Practical guide: which design choices help when adding image input to LLMs

Key finding

HaELM: an LLM-based, low-cost evaluator to detect and analyze hallucinations in vision-language models

Key finding

Use server-side multimodal LLMs to bootstrap federated learning on heterogeneous, long-tailed image data

Key finding