WELLA: fine-tuned LLM agents that generate dynamic workload estimates for multi‑operator nuclear control rooms

Overview

Decision SnapshotNeeds Validation

Results show strong accuracy on simulator labels, but evidence is limited to specific HTGR scenarios and synthetic cognitive trajectories; broader validation is needed before production use.

Citations1

Evidence Strength0.60

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/3

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 40%

Novelty: 60%

Authors

Xingyu Xiao, Peng Chen, Qianqian Jia, Jiejuan Tong, Jingang Liang, Haitao Wang

Links

Abstract / PDF

Why It Matters For Business

Automating realistic workload data reduces expert labor and enables faster, cheaper safety testing and training for multi-operator control rooms.

Who Should Care

ML Engineer Engineering Lead Product Manager Data Scientist

Summary TLDR

This paper introduces WELLA, a workflow that fine-tunes Qwen2.5-7B to simulate operator cognition and predict workload in multi-operator nuclear control-room scenarios. Authors collect NASA-TLX and SART labels from an HTGR simulator (startup 28, shutdown 11, accident 30 cases), generate virtual cognitive trajectories via Claude, then SFT the model with Llama-factory. WELLA yields strong accuracy on simulator-derived labels (e.g., R2 up to 0.9628 for RO3) and beats baseline LLMs (GPT-4, GPT-4o, Claude-3.5-Sonnet) on MAE/RMSE in these scenario tests. Limitations: scenario-limited data, operator-specific traits not modeled, and synthetic trajectories may inherit generator bias.

Problem Statement

HRA (human reliability analysis) lacks dynamic, fine-grained workload data. Existing approaches are static, expert-driven, or labor-intensive. The paper seeks an automated, scenario-based method to generate dynamic workload and situational-awareness labels for multi-agent operations.

Main Contribution

A pipeline (WELLA) that uses LLMs and multi-agent role play to generate dynamic workload and situational-awareness data.

Fine-tuning Qwen2.5-7B with supervised data (NASA-TLX and SART) using Llama-factory to create a domain model for workload prediction.

Key Findings

WELLA predicts per-role workload with very high fit for RO3.

NumbersRO3 R2=0.9628, RMSE=3.5327, MAE=1.92

Practical UseIf you need accurate workload estimates in similar reactor-role simulations, fine-tuning a domain LLM like WELLA can cut prediction errors substantially versus off-the-shelf LLMs.

Evidence RefTable 3

WELLA outperforms commercial LLMs on aggregate simulator data.

NumbersALL data: WELLA R2=0.3822, MAE=4.5161 vs GPT-4 MAE=30.1935

Practical UseDomain-tuned LLMs give much lower absolute errors on local simulator labels; prefer SFTed models when you have labeled simulator surveys.

Evidence RefTable 6

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
RO3 R2	0.9628	GPT-4 R2 = -2.0091	+2.9719	Scenario RO3	Table 3: WELLA R2=0.9628 vs GPT-4 R2=-2.0091	Table 3
ALL MAE	4.5161	GPT-4 MAE = 30.1935	-25.6774	Combined dataset (ALL)	Table 6: WELLA MAE=4.5161 vs GPT-4 MAE=30.1935	Table 6

What To Try In 7 Days

Collect a small set of simulator TLX/SART labels for a target scenario.

Fine-tune a public LLM (e.g., Qwen2.5-7B) with Llama-factory on that data and test MAE/R2 against simple baselines.

Design per-role prompts and run a few multi-agent role-play simulations to inspect predicted workload trajectories.

Agent Features

Memory

Virtual cognitive trajectory (short-term role reasoning)

Planning

Scenario-driven role simulation

Tool Use

Llama-factoryClaude (for trajectory generation)

Frameworks

Llama-factory

Is Agentic

Yes

Architectures

Multi-agent LLM rolesSFT

Collaboration

Multi-role coordination (RO1/RO2/RO3/CO/SO)

Optimization Features

Training Optimization

SFT

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

Training and evaluation limited to HTGR simulator scenarios and modest case counts.

Virtual cognitive trajectories were generated by another LLM (Claude), which may introduce bias.

When Not To Use

Do not use as a live control-room decision tool without human oversight and further validation.

Not suitable for domains with no similar simulator data or different operator workflows.

Failure Modes

Hallucinated or implausible cognitive steps from generated trajectories.

Poor predictions for roles or scenarios that were under-described (noted for SO).

Core Entities

Models

Qwen2.5-7BWELLA (fine-tuned Qwen2.5-7B)

Metrics

R2RMSEMAEEV

Datasets

HTGR simulator operator data (NASA-TLX, SART)Virtual cognitive trajectory library (LLM-generated)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

WELLA predicts per-role workload with very high fit for RO3.

WELLA outperforms commercial LLMs on aggregate simulator data.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

You May Also Want to Read

Survey: Reframe LLMs as agents that plan, act, and continually learn

Key finding

TRiSM: practical trust, risk and security controls for LLM-based multi-agent systems

Key finding

RAPS: intent-driven, reputation-aware publish–subscribe for adaptive multi-agent LLM coordination

Key finding

Survey of safe interfaces, threat models, and standards for LLM-driven agents that act on blockchains

Key finding

ACP: a layered, federated protocol for secure cross-platform agent-to-agent collaboration

Key finding