Overview
Production Readiness
0.4
Novelty Score
0.5
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
Domain-adapted VLMs give clearer, more factual multimodal answers in specialized scientific domains, lowering risk from hallucinations and improving downstream tasks such as literature review and figure interpretation.
Summary TLDR
The authors adapt LLaVA vision-language models to a focused biomedical domain (low-dose radiation therapy, LDRT). They collected 42,673 open-access articles, filtered images and captions, and produced 50,882 image–text training pairs plus a 1,574-pair evaluation set. By training a 2-stage pipeline (projector alignment then instruction fine-tuning) with LoRA and memory optimizations, their fine-tuned models score about 1–2 points higher (0–10 judge scale) on VQA judged by large LMs and show a large drop in hedging language ("appears" 1,451→49). Models are stronger at complex reasoning and show fewer hallucinations on the curated LDRT evaluation set, but evaluation relies on LLM-as-judge and L
Problem Statement
General-purpose vision-language models hallucinate and miss domain details when reading scientific figures. The paper asks: can we adapt LLaVA-style VLMs to biomedical literature (LDRT) to improve factuality and multimodal question answering?
Main Contribution
Built an LDRT-focused corpus: 42,673 articles → filtered set of ~150k image-caption pairs → produced 50,882 training and 1,574 evaluation pairs.
Two-stage adaptation of LLaVA: projector alignment (train MLP projector) then instruction fine-tuning of the LLM, keeping vision encoder frozen.
Efficient finetuning recipe: LoRA (r=128, α=256), gradient checkpointing, FlashAttention-2, DeepSpeed ZeRO3 to fit training on 4×A40 GPUs.
Evaluation with LLM-as-a-judge (Qwen2-72B and Llama-3.1-70B) and ROUGE/linguistic checks to measure factual consistency and hallucination.
Key Findings
Fine-tuned models score higher on judged VQA (0–10 scale).
Stronger improvement seen under a larger judge.
Marked reduction in hedging language linked to hallucination.
Training and eval dataset sizes and split.
Results
LLM-as-a-judge overall mean (Qwen2-72B)
LLM-as-a-judge overall mean (Llama-3.1-70B)
Hedging token count ('appears')
ROUGE alignment
Who Should Care
What To Try In 7 Days
Run PDF→Markdown extraction (Marker) and figure extraction (pdf2figures) on a small internal corpus.
Generate or curate ~5–10k image–caption pairs and make a 5% holdout eval set.
Fine-tune an open LLaVA checkpoint with LoRA on one workstation; test outputs using an LLM judge like Qwen2 or Llama-3.
Optimization Features
Infra Optimization
- DeepSpeed ZeRO3 for model/data parallelism
Model Optimization
- LoRA
System Optimization
- FlashAttention-2 for memory- efficient attention
Training Optimization
- Projector alignment then instruction finetuning
- Gradient checkpointing to save activation memory
Inference Optimization
- Low temperature (0.2) and max token limits for stable generation
Reproducibility
Open Source Status
- partial
Risks & Boundaries
Limitations
- Evaluation depends on LLM-as-a-judge; judge bias may inflate apparent gains.
- Synthetic QA pairs were generated by a large LLM, which can inject artifacts and domain bias.
- Models were trained for a single epoch on domain data; long-term or broader generalization is untested.
- No claim of clinical validation; not suitable for medical decision-making.
When Not To Use
- Do not deploy for clinical diagnosis or treatment recommendations.
- Avoid using as a single source of truth for regulatory or patient-facing outputs.
- Not validated for imaging modalities outside the scraped LDRT literature.
Failure Modes
- Residual hallucination on out-of-domain or unfamiliar figures.
- Verbosity or overconfident statements despite factual gaps.
- Evaluation blind spots due to judge LLM agreement or shared biases.
Core Entities
Models
- LLaVA v1.5-13B
- LLaVA v1.6-vicuna-13B
- CLIP ViT-L/14
- Qwen2-72B-Instruct (judge)
- Llama-3.1-70B-Instruct (judge)
Metrics
- LLM-as-a-judge score (0-10)
- ROUGE-1 / ROUGE-2
- Length ratio to ground truth
- Hedging token counts (e.g., 'appears')
Datasets
- LDRT corpus (42,673 articles scraped from Semantic Scholar)
- 50,882 image-text training pairs (derived)
- 1,574 image-text evaluation pairs (held-out)
Benchmarks
- LDRT VQA evaluation set (this paper)

