Hierarchical ReAct agents ground LLMs to Materials Project data and run language-driven simulations with near-zero hallucination

Overview

Decision SnapshotReady For Pilot

The method uses established RAG and ReAct tooling but assembles them in a hierarchical, self-correcting agent graph; experiments across multiple properties and live workflows support practical utility.

Citations21

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 55%

Authors

Yuan Chiang, Elvis Hsieh, Chia-Hong Chou, Janosh Riebesell

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Grounding LLMs to authoritative databases and tools reduces dangerous hallucinations and lets teams automate reproducible workflows (data fetch → simulation → analysis) without model fine-tuning, cutting verification time and accelerating materials R&D.

Who Should Care

CTO Product Manager ML Engineer Data Scientist

Summary TLDR

LLaMP is a retrieval-augmented generation (RAG) framework that layers a supervisor ReAct agent over specialized assistant ReAct agents to let LLMs (no fine-tuning) query the Materials Project, literature, and run atomistic simulations. On benchmarks for bulk modulus, formation energy, bandgap, and magnetism, LLaMP improves answer consistency and cuts large errors from vanilla LLMs. It also supports crystal editing and language-driven molecular dynamics using ML force fields. Code and demo are provided.

Problem Statement

LLMs hallucinate and lack up-to-date domain memory. Fine-tuning is brittle and not easily traceable. Scientists need reliable, verifiable LLM answers tied to high-fidelity materials databases and executable simulation tools.

Main Contribution

A hierarchical ReAct agent architecture (supervisor + specialized assistants) that delegates API/tool calls to reduce context load and enable self-correction.

A self-consistency metric (SCoR) combining uncertainty and confidence to measure reproducibility of numeric LLM answers.

Key Findings

LLaMP reduces bulk-modulus prediction error compared to web-augmented GPT-4 and other baselines.

NumbersBulk modulus MAE = 14.57 GPa (LLaMP) vs ~41 GPa (GPT-4/GPT-4+Serp) on evaluated set

Practical UseUse LLaMP RAG when you need materially meaningful elastic properties; it cuts large systematic errors from vanilla LLMs on evaluated materials.

Evidence RefTable 1; Figure 2

LLaMP yields near-perfect consistency for some numeric queries.

NumbersBandgap (common compounds) SCoR = 1.00 and MAE = 0.00 eV (LLaMP)

Practical UseFor well-covered properties, LLaMP gives repeatable answers—suitable for pipelines that require deterministic numeric outputs.

Evidence RefTable 1 (Electronic Bandgap - Common)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Bulk modulus MAE (GPa)	14.574 (LLaMP, GPT-4 backend)	41.225 (GPT-4)	-26.65 GPa	sampled metals / materials in Table 1	Table 1 reports MAE averaged over five runs and sampled materials	Table 1
Formation energy MAE (eV/atom)	0.009 (LLaMP)	1.680 (GPT-4)	-1.671 eV	sampled compounds in Table 1	Table 1 averages over five runs and sampled materials	Table 1

What To Try In 7 Days

Clone the repo and run the live demo to see retrieval and simulation examples.

Try a single property query (e.g., formation energy) and compare LLaMP vs vanilla GPT-4 outputs.

Hook LLaMP assistant agents to your internal materials DB and test a simple RAG query pipeline.

Agent Features

Memory

episodic memory retrieved from assistant agentsvector DB suggested for caching QA pairs

Planning

hierarchical planningtask decomposition by supervisor

Tool Use

function calling to APIsPython REPL executionatomate2 simulation workflows

Frameworks

LangChainReAct

Is Agentic

Yes

Architectures

hierarchical ReAct agents (supervisor + assistants)

Collaboration

supervisor routes subtasks to specialized assistants

Optimization Features

Token Efficiency

schema offloading reduces unnecessary tokens

System Optimization

modular assistants for extensibility and reduced per-agent cognitive load

Inference Optimization

offload API schemas to assistant agents to reduce context window use

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/chiang-yuan/llamp

Data URLs

https://materialsproject.org https://arxiv.org

Risks & Boundaries

Limitations

Performance depends on Materials Project coverage and DFT approximations (GGA underestimates bandgaps).

Relies on backbone LLM function-calling quality; some models mis-handle API schemas or sort sign conventions.

When Not To Use

When your required data is absent or unreliable in Materials Project.

If you require integrated experimental data sources not supported by current assistant agents.

Failure Modes

Agent mis-parses API schema leading to wrong query arguments (e.g., sort sign errors).

Hallucination when RAG tools are not available or not used.

Core Entities

Models

GPT-4GPT-3.5Llama3-8bGemini-1.5-FlashClaude-3.5-Sonnet

Metrics

Self-consistency of Response (SCoR)Coefficient of Precision (CoP)Precision (std dev)Confidence (valid responses ratio)Mean Absolute Error (MAE)Accuracy

Datasets

Materials Project (MP)Inorganic Crystal Structure Database (ICSD)arXivWikipedia

Benchmarks

bulk modulus predictionformation energy predictionelectronic bandgap (common and multi-element)magnetic ordering and magnetization

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

LLaMP reduces bulk-modulus prediction error compared to web-augmented GPT-4 and other baselines.

LLaMP yields near-perfect consistency for some numeric queries.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Datasets

You May Also Want to Read

Chemistry foundation models power structure-focused multimodal RAG inside hierarchical multi-agent workflows

Key finding

Create, customize, and run multi-step LLM agents from plain language — no code needed

Key finding

COMPASS: a multi-agent orchestration that uses RAG and an LLM-as-judge to enforce sovereignty, carbon-awareness, compliance, and ethics in实时

Key finding

AgentAuditor: memory‑augmented RAG + CoT that makes LLM evaluators reach human-level accuracy on agent safety

Key finding

Use multi-agent RAG plus a hybrid vector-graph memory to auto-generate traceable test plans and cases, cutting test-document work by ~85% in

Key finding