Iteratively generate and verify domain instructions (MatSci-Instruct) to finetune LLaMA into HoneyBee, a materials-science LLM

October 12, 20237 min

Overview

Decision SnapshotNeeds Validation

The approach is practical and reproducible but depends on external commercial LLMs and shows strong in-domain gains; generalization beyond tested benchmarks is not yet demonstrated.

Citations5

Evidence Strength0.70

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 60%

Authors

Yu Song, Santiago Miret, Huan Zhang, Bang Liu

Links

Abstract / PDF / Code / Data

Why It Matters For Business

You can cheaply create domain-ready LLMs by synthesizing and verifying instruction data, avoiding costly domain pretraining while getting strong task performance.

Who Should Care

Summary TLDR

The authors present MatSci-Instruct, a two-step pipeline that (1) uses a strong LLM as an Instructor (ChatGPT) to synthesize instruction-response pairs, (2) uses an independent Verifier (Claude) to score and filter those pairs on accuracy, relevance, completeness and reasonableness, and (3) closes a feedback loop with an Evaluator (GPT-4) to iteratively refine instructions. They use this curated instruction data (≈52k examples) to progressively finetune LLaMA 7B/13B via LoRA, producing HoneyBee models. HoneyBee improves across iterative stages and outperforms baseline LLaMA/Alpaca and domain BERTs on the MatSci-NLP benchmark and on internal instruction evaluations. Code and datasets are said

Problem Statement

Materials science lacks large, high-quality open text corpora and specialized modern LLMs. Synthetic instruction data can help, but generated data must be checked for scientific accuracy before finetuning models for real tasks.

Main Contribution

MatSci-Instruct: a domain-agnostic generation+verification pipeline for trustworthy instruction data using separate Instructor/Verifier/Evaluator LLMs.

A curated MatSci instruction dataset (~52k instructions) with per-example scores and filtering.

Key Findings

Automatically verified instruction scores correlate well with human experts.

NumbersSpearman/Pearson correlations 0.60.8 vs humans (Fig.4)

Practical UseYou can use an LLM verifier to pre-filter synthetic scientific instructions with reasonable alignment to experts, reducing manual review.

Evidence RefSection 4.1, Fig.4

Progressive instruction refinement raises HoneyBee's instruction-eval accuracy to near or above instructor levels.

NumbersHB-13b-Stage3 accuracy 98.11 vs Chat-GPT 92.55 (Table2)

Practical UseIterative generate-verify-finetune can produce a domain model that matches or outperforms the instruction source on the same evaluation.

Evidence RefTable 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
AccuracyHB-13b-Stage3 98.11Chat-GPT 92.55+5.56MatSci-Instruct eval (GPT-4 scoring, Table2)Table 2 reports stagewise accuracy improvements to 98.11 for HB-13b-Stage3.Table 2
Instruction-eval CompletenessHB-7b-Stage3 95.78LLaMA-7b 90.36+5.42MatSci-Instruct eval (GPT-4 scoring, Table2)Table 2 stagewise completeness improves for HB-7b to 95.78 at Stage3.Table 2

What To Try In 7 Days

Run the MatSci-Instruct repo and inspect a sample of the 52k instructions.

Fine-tune a LLaMA-7B with LoRA on a small verified subset and compare against a BERT baseline on a target task.

Add an automatic verifier (Claude or other) to filter generated training pairs and measure alignment with a few human labels.

Agent Features

Planning
progressive refinement-feedback loop
Tool Use
ChatGPT as InstructorClaude as VerifierGPT-4 as Evaluator
Frameworks
MatSci-InstructLoRA
Architectures
LLaMA
Collaboration
multi-LLM pipeline (instructor/verifier/evaluator)

Optimization Features

System Optimization
training on A100 GPUs (2x for 7B, 4x for 13B)
Training Optimization
LoRAearly stopping on validation score S_val_best

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Unclear generalization beyond MatSci-NLP and MatSci-Instruct data.

Pipeline depends on commercial LLMs (ChatGPT/Claude/GPT-4) which may be costly or change over time.

When Not To Use

When you need rigorously validated experimental protocols (e.g., synthesis recipes) without expert oversight.

If you cannot access or afford the Instructor/Verifier/Evaluator LLMs.

Failure Modes

Verifier LLMs may share systematic biases with Instructor, letting plausible but wrong examples pass.

Overfitting to synthetic instruction styles can reduce real-world robustness.

Core Entities

Models

LLaMA-7bLLaMA-13bHoneyBee-7bHoneyBee-13bAlpaca-7bAlpaca-13bChat-GPTClaudeGPT-4MatBERTMatSciBERT

Metrics

AccuracyCompletenessReasonablenessMacro-F1Micro-F1Spearman correlationPearson correlation

Datasets

MatSci-Instruct (52k instructions)MatSci-NLP benchmark

Benchmarks

MatSci-NLP