Iteratively generate and verify domain instructions (MatSci-Instruct) to finetune LLaMA into HoneyBee, a materials-science LLM

October 12, 20237 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

5

Authors

Yu Song, Santiago Miret, Huan Zhang, Bang Liu

Links

Abstract / PDF

Why It Matters For Business

You can cheaply create domain-ready LLMs by synthesizing and verifying instruction data, avoiding costly domain pretraining while getting strong task performance.

Summary TLDR

The authors present MatSci-Instruct, a two-step pipeline that (1) uses a strong LLM as an Instructor (ChatGPT) to synthesize instruction-response pairs, (2) uses an independent Verifier (Claude) to score and filter those pairs on accuracy, relevance, completeness and reasonableness, and (3) closes a feedback loop with an Evaluator (GPT-4) to iteratively refine instructions. They use this curated instruction data (≈52k examples) to progressively finetune LLaMA 7B/13B via LoRA, producing HoneyBee models. HoneyBee improves across iterative stages and outperforms baseline LLaMA/Alpaca and domain BERTs on the MatSci-NLP benchmark and on internal instruction evaluations. Code and datasets are said

Problem Statement

Materials science lacks large, high-quality open text corpora and specialized modern LLMs. Synthetic instruction data can help, but generated data must be checked for scientific accuracy before finetuning models for real tasks.

Main Contribution

MatSci-Instruct: a domain-agnostic generation+verification pipeline for trustworthy instruction data using separate Instructor/Verifier/Evaluator LLMs.

A curated MatSci instruction dataset (~52k instructions) with per-example scores and filtering.

HoneyBee: progressively finetuned LLaMA-7B/13B models using LoRA and iterative data refinement.

Empirical evaluations showing progressive gains across stages and improved performance on MatSci-NLP and internal instruction tests.

Open release of code and datasets (repository link provided).

Key Findings

Automatically verified instruction scores correlate well with human experts.

NumbersSpearman/Pearson correlations 0.6–0.8 vs humans (Fig.4)

Progressive instruction refinement raises HoneyBee's instruction-eval accuracy to near or above instructor levels.

NumbersHB-13b-Stage3 accuracy 98.11 vs Chat-GPT 92.55 (Table2)

HoneyBee beats domain-specific BERTs under low-resource finetuning on MatSci-NLP.

NumbersLow-resource overall macro-F1: HoneyBee-13b 0.80 vs MatBERT 0.722 (Table3)

MatSci-Instruct yields a moderately large, diverse instruction corpus.

Numbers≈52,658 instructions; avg input length 920.8 words (Table1)

Results

Accuracy

ValueHB-13b-Stage3 98.11

BaselineChat-GPT 92.55

Instruction-eval Completeness

ValueHB-7b-Stage3 95.78

BaselineLLaMA-7b 90.36

Low-resource overall macro-F1 (MatSci-NLP)

ValueHoneyBee-13b 0.80

BaselineMatBERT 0.722

Zero-shot overall macro-F1 (MatSci-NLP)

ValueHoneyBee-13b-Stage3 0.557

BaselineGPT-4 0.543

Who Should Care

What To Try In 7 Days

Run the MatSci-Instruct repo and inspect a sample of the 52k instructions.

Fine-tune a LLaMA-7B with LoRA on a small verified subset and compare against a BERT baseline on a target task.

Add an automatic verifier (Claude or other) to filter generated training pairs and measure alignment with a few human labels.

Agent Features

Planning

  • progressive refinement-feedback loop

Tool Use

  • ChatGPT as Instructor
  • Claude as Verifier
  • GPT-4 as Evaluator

Frameworks

  • MatSci-Instruct
  • LoRA

Architectures

  • LLaMA

Collaboration

  • multi-LLM pipeline (instructor/verifier/evaluator)

Optimization Features

System Optimization

  • training on A100 GPUs (2x for 7B, 4x for 13B)

Training Optimization

  • LoRA
  • early stopping on validation score S_val_best

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Unclear generalization beyond MatSci-NLP and MatSci-Instruct data.
  • Pipeline depends on commercial LLMs (ChatGPT/Claude/GPT-4) which may be costly or change over time.
  • Verification relies on LLM judges that can have bias or blind spots despite correlation with humans.
  • Paper does not state release of HoneyBee model weights, limiting full reproducibility of deployments.

When Not To Use

  • When you need rigorously validated experimental protocols (e.g., synthesis recipes) without expert oversight.
  • If you cannot access or afford the Instructor/Verifier/Evaluator LLMs.
  • For tasks requiring reliable, provable scientific claims without human review.

Failure Modes

  • Verifier LLMs may share systematic biases with Instructor, letting plausible but wrong examples pass.
  • Overfitting to synthetic instruction styles can reduce real-world robustness.
  • High-scoring outputs may still hallucinate factual details not present in training data.

Core Entities

Models

  • LLaMA-7b
  • LLaMA-13b
  • HoneyBee-7b
  • HoneyBee-13b
  • Alpaca-7b
  • Alpaca-13b
  • Chat-GPT
  • Claude
  • GPT-4
  • MatBERT
  • MatSciBERT

Metrics

  • Accuracy
  • Completeness
  • Reasonableness
  • Macro-F1
  • Micro-F1
  • Spearman correlation
  • Pearson correlation

Datasets

  • MatSci-Instruct (52k instructions)
  • MatSci-NLP benchmark

Benchmarks

  • MatSci-NLP