GlassLLaVA: a vision-language model that interprets SEM images of glass using paper text and GPT-4–generated Q&A

September 21, 20238 min

Overview

Decision SnapshotNeeds Validation

The system shows promising, reproducible gains from context on a focused glass SEM dataset, but the work is limited to one material class, uses GPT-4 for generation/evaluation, and lacks broad open release and wide human expert validation.

Citations3

Evidence Strength0.60

Confidence0.80

Risk Signals12

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/6

Reproducibility

Status: No open assets linked

Open source: Partial

License: Model, data, and checkpoints intended for research use only; relies on OpenLLaMA

At A Glance

Cost impact: 40%

Production readiness: 40%

Novelty: 60%

Authors

Abdulelah S. Alshehri, Franklin L. Lee, Shihu Wang

Links

Abstract / PDF

Why It Matters For Business

Pairing image encoders with LLMs can automate interpretation of lab SEM images and speed defect triage, but the model needs context and domain-specific data to reach reliable accuracy.

Who Should Care

Summary TLDR

The authors build GlassLLaVA, a multimodal vision-language model adapted from LLaVA to interpret scanning electron microscopy (SEM) images of glass. They curated 72 papers (481 SEM images total: 404 train, 77 eval), used GPT-4 to generate 4,291 training and 757 evaluation question-answer pairs, and fine-tuned an OpenLLaMA-7B LLM with a CLIP ViT-L/14 encoder. GlassLLaVA scores improve strongly with added context: overall quality averages 67 for simple questions and 77 for complex ones; defect detection rises from 75% (no context) to 95% (high context). The dataset, evaluation prompts, and license are research-only.

Problem Statement

Interpreting SEM images needs expert domain knowledge and text context. Existing vision or language models lack combined visual+literature grounding for materials-science reasoning. The authors aim to build and evaluate a multimodal model that reads SEM images and related paper text to generate human-like interpretations.

Main Contribution

Curated a focused glass SEM dataset: 72 papers, 481 SEM images (404 train, 77 eval).

Used GPT-4 to generate 4,291 training and 757 evaluation question-answer pairs from paper text/images.

Key Findings

Context strongly improves answer quality.

NumbersGeneral: 68.84 (no context) → 92.56 (high context)

Practical UseProvide richer context (material, processing, properties) to get more accurate model interpretations.

Evidence RefResults, Fig.4; Context Assessment

Complex (context-rich) questions scored higher than simple ones.

NumbersAvg quality: simple 67 vs complex 77 (0100)

Practical UseWhen using the model, frame prompts with background and specifics rather than bare questions.

Evidence RefResults, Overall Quality Assessment, Fig.3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Training and evaluation Q&A counts4,291 train Q&A; 757 eval Q&AGPT-4 generated from 72 papersMethods: Data Generation; Table 1Data Generation, Table 1
Dataset size (images)404 train images; 77 eval images; total 48172 papers (62 train / 10 eval)Data Extraction: 62 papers (404 SEM images) train, 10 papers (77 images) evalData Extraction section

What To Try In 7 Days

Run a small pilot: fine-tune a vision-language model on 200–500 labeled SEM images with paper captions.

Generate targeted Q&A using GPT-4 and validate 200 items with an expert to seed supervised training.

Add sample metadata (material, process) to prompts and measure defect-spotting change before/after context.

Agent Features

Tool Use
GPT-4 (Q&A gen and grading)t-SNEK-Means
Architectures
LLaVA (extended)OpenLLaMA-7BCLIP ViT-L/14

Optimization Features

Infra Optimization
8 A100 GPUs; batch size 4; learning rate 1e-5; 28 epochs
System Optimization
Trained on 8 nodes each with NVIDIA A100 (40GB)
Training Optimization
Fine-tuned from LLaVA COCO checkpointCosine LR scheduler

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusPartial
LicenseModel, data, and checkpoints intended for research use only; relies on OpenLLaMA

Risks & Boundaries

Limitations

Dataset limited to glass SEM literature; not validated on other materials.

Training data likely biased toward images that include defects in the literature.

When Not To Use

Do not use for other material systems without retraining and new benchmarks.

Not ready for safety-critical or regulatory decisions without expert review.

Failure Modes

Hallucinated or overconfident descriptions when context is missing.

False negatives or 'needs more info' answers on low-context prompts.

Core Entities

Models

GlassLLaVALLaVAOpenLLaMA-7BCLIP ViT-L/14GPT-4

Metrics

Overall Quality (0-100)Context Assessment (None/Basic/Moderate/High)Feature Identification (1-4)Defect Detection (binary true/false)

Datasets

Glass SEM paper set (72 papers, 481 SEM images; 404 train, 77 eval)LLaVA COCO (pretraining base)

Benchmarks

Benchmark answers extracted from source papers (paper-generated answers)