GlassLLaVA: a vision-language model that interprets SEM images of glass using paper text and GPT-4–generated Q&A

September 21, 20238 min

Overview

Production Readiness

0.4

Novelty Score

0.6

Cost Impact Score

0.4

Citation Count

3

Authors

Abdulelah S. Alshehri, Franklin L. Lee, Shihu Wang

Links

Abstract / PDF

Why It Matters For Business

Pairing image encoders with LLMs can automate interpretation of lab SEM images and speed defect triage, but the model needs context and domain-specific data to reach reliable accuracy.

Summary TLDR

The authors build GlassLLaVA, a multimodal vision-language model adapted from LLaVA to interpret scanning electron microscopy (SEM) images of glass. They curated 72 papers (481 SEM images total: 404 train, 77 eval), used GPT-4 to generate 4,291 training and 757 evaluation question-answer pairs, and fine-tuned an OpenLLaMA-7B LLM with a CLIP ViT-L/14 encoder. GlassLLaVA scores improve strongly with added context: overall quality averages 67 for simple questions and 77 for complex ones; defect detection rises from 75% (no context) to 95% (high context). The dataset, evaluation prompts, and license are research-only.

Problem Statement

Interpreting SEM images needs expert domain knowledge and text context. Existing vision or language models lack combined visual+literature grounding for materials-science reasoning. The authors aim to build and evaluate a multimodal model that reads SEM images and related paper text to generate human-like interpretations.

Main Contribution

Curated a focused glass SEM dataset: 72 papers, 481 SEM images (404 train, 77 eval).

Used GPT-4 to generate 4,291 training and 757 evaluation question-answer pairs from paper text/images.

Adapted LLaVA into GlassLLaVA: CLIP ViT-L/14 vision encoder + OpenLLaMA-7B LLM fine-tuned end-to-end.

Defined an evaluation suite: overall quality (0–100), context sensitivity, feature identification (1–4), and binary defect detection.

Showed that richer context substantially improves answer quality and defect detection accuracy.

Key Findings

Context strongly improves answer quality.

NumbersGeneral: 68.84 (no context) → 92.56 (high context)

Complex (context-rich) questions scored higher than simple ones.

NumbersAvg quality: simple 67 vs complex 77 (0–100)

Defect detection accuracy rises with added context.

NumbersDetection: 75% (none) → 95% (high context)

Feature identification improves with context.

NumbersMorphology score: 2.35 (no context) → 3.58 (high) on a 1–4 scale

Data and Q&A generation used GPT-4 and a human oracle.

Numbers4,291 training Q&A and 757 evaluation Q&A generated with GPT-4; human validated extraction

Results

Training and evaluation Q&A counts

Value4,291 train Q&A; 757 eval Q&A

Dataset size (images)

Value404 train images; 77 eval images; total 481

Overall quality scores

ValueSimple avg 67; Complex avg 77 (scale 0–100)

Context sensitivity (example)

ValueGeneral: 68.84 → 92.56 (none → high)

Accuracy

Value75% (none), 82.5% (basic), 92.5% (moderate), 95% (high)

Feature identification (morphology example)

ValueMorphology: 2.35 → 3.58 (no → high context) on 1–4 scale

Who Should Care

What To Try In 7 Days

Run a small pilot: fine-tune a vision-language model on 200–500 labeled SEM images with paper captions.

Generate targeted Q&A using GPT-4 and validate 200 items with an expert to seed supervised training.

Add sample metadata (material, process) to prompts and measure defect-spotting change before/after context.

Agent Features

Tool Use

  • GPT-4 (Q&A gen and grading)
  • t-SNE
  • K-Means

Architectures

  • LLaVA (extended)
  • OpenLLaMA-7B
  • CLIP ViT-L/14

Optimization Features

Infra Optimization

  • 8 A100 GPUs; batch size 4; learning rate 1e-5; 28 epochs

System Optimization

  • Trained on 8 nodes each with NVIDIA A100 (40GB)

Training Optimization

  • Fine-tuned from LLaVA COCO checkpoint
  • Cosine LR scheduler

Reproducibility

License

  • Model, data, and checkpoints intended for research use only; relies on OpenLLaMA

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Dataset limited to glass SEM literature; not validated on other materials.
  • Training data likely biased toward images that include defects in the literature.
  • GPT-4 used both to create and grade Q&A, introducing judge and generation bias.
  • Limited human expert scoring — single human evaluator guided prompts and checks.
  • Model built from a 7B OpenLLaMA reproduction; larger LLMs may change results.

When Not To Use

  • Do not use for other material systems without retraining and new benchmarks.
  • Not ready for safety-critical or regulatory decisions without expert review.
  • Avoid relying on the model alone where low-context inputs are common.

Failure Modes

  • Hallucinated or overconfident descriptions when context is missing.
  • False negatives or 'needs more info' answers on low-context prompts.
  • Biased defect detection due to overrepresentation of defects in training literature.
  • Evaluation circularity when GPT-4 grades GPT-4–seeded data.

Core Entities

Models

  • GlassLLaVA
  • LLaVA
  • OpenLLaMA-7B
  • CLIP ViT-L/14
  • GPT-4

Metrics

  • Overall Quality (0-100)
  • Context Assessment (None/Basic/Moderate/High)
  • Feature Identification (1-4)
  • Defect Detection (binary true/false)

Datasets

  • Glass SEM paper set (72 papers, 481 SEM images; 404 train, 77 eval)
  • LLaVA COCO (pretraining base)

Benchmarks

  • Benchmark answers extracted from source papers (paper-generated answers)