WELLA: fine-tuned LLM agents that generate dynamic workload estimates for multi‑operator nuclear control rooms

January 16, 20256 min

Overview

Decision SnapshotNeeds Validation

Results show strong accuracy on simulator labels, but evidence is limited to specific HTGR scenarios and synthetic cognitive trajectories; broader validation is needed before production use.

Citations1

Evidence Strength0.60

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/3

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 40%

Novelty: 60%

Authors

Xingyu Xiao, Peng Chen, Qianqian Jia, Jiejuan Tong, Jingang Liang, Haitao Wang

Links

Abstract / PDF

Why It Matters For Business

Automating realistic workload data reduces expert labor and enables faster, cheaper safety testing and training for multi-operator control rooms.

Who Should Care

Summary TLDR

This paper introduces WELLA, a workflow that fine-tunes Qwen2.5-7B to simulate operator cognition and predict workload in multi-operator nuclear control-room scenarios. Authors collect NASA-TLX and SART labels from an HTGR simulator (startup 28, shutdown 11, accident 30 cases), generate virtual cognitive trajectories via Claude, then SFT the model with Llama-factory. WELLA yields strong accuracy on simulator-derived labels (e.g., R2 up to 0.9628 for RO3) and beats baseline LLMs (GPT-4, GPT-4o, Claude-3.5-Sonnet) on MAE/RMSE in these scenario tests. Limitations: scenario-limited data, operator-specific traits not modeled, and synthetic trajectories may inherit generator bias.

Problem Statement

HRA (human reliability analysis) lacks dynamic, fine-grained workload data. Existing approaches are static, expert-driven, or labor-intensive. The paper seeks an automated, scenario-based method to generate dynamic workload and situational-awareness labels for multi-agent operations.

Main Contribution

A pipeline (WELLA) that uses LLMs and multi-agent role play to generate dynamic workload and situational-awareness data.

Fine-tuning Qwen2.5-7B with supervised data (NASA-TLX and SART) using Llama-factory to create a domain model for workload prediction.

Key Findings

WELLA predicts per-role workload with very high fit for RO3.

NumbersRO3 R2=0.9628, RMSE=3.5327, MAE=1.92

Practical UseIf you need accurate workload estimates in similar reactor-role simulations, fine-tuning a domain LLM like WELLA can cut prediction errors substantially versus off-the-shelf LLMs.

Evidence RefTable 3

WELLA outperforms commercial LLMs on aggregate simulator data.

NumbersALL data: WELLA R2=0.3822, MAE=4.5161 vs GPT-4 MAE=30.1935

Practical UseDomain-tuned LLMs give much lower absolute errors on local simulator labels; prefer SFTed models when you have labeled simulator surveys.

Evidence RefTable 6

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
RO3 R20.9628GPT-4 R2 = -2.0091+2.9719Scenario RO3Table 3: WELLA R2=0.9628 vs GPT-4 R2=-2.0091Table 3
ALL MAE4.5161GPT-4 MAE = 30.1935-25.6774Combined dataset (ALL)Table 6: WELLA MAE=4.5161 vs GPT-4 MAE=30.1935Table 6

What To Try In 7 Days

Collect a small set of simulator TLX/SART labels for a target scenario.

Fine-tune a public LLM (e.g., Qwen2.5-7B) with Llama-factory on that data and test MAE/R2 against simple baselines.

Design per-role prompts and run a few multi-agent role-play simulations to inspect predicted workload trajectories.

Agent Features

Memory
Virtual cognitive trajectory (short-term role reasoning)
Planning
Scenario-driven role simulation
Tool Use
Llama-factoryClaude (for trajectory generation)
Frameworks
Llama-factory
Is Agentic

Yes

Architectures
Multi-agent LLM rolesSFT
Collaboration
Multi-role coordination (RO1/RO2/RO3/CO/SO)

Optimization Features

Training Optimization
SFT

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Training and evaluation limited to HTGR simulator scenarios and modest case counts.

Virtual cognitive trajectories were generated by another LLM (Claude), which may introduce bias.

When Not To Use

Do not use as a live control-room decision tool without human oversight and further validation.

Not suitable for domains with no similar simulator data or different operator workflows.

Failure Modes

Hallucinated or implausible cognitive steps from generated trajectories.

Poor predictions for roles or scenarios that were under-described (noted for SO).

Core Entities

Models

Qwen2.5-7BWELLA (fine-tuned Qwen2.5-7B)

Metrics

R2RMSEMAEEV

Datasets

HTGR simulator operator data (NASA-TLX, SART)Virtual cognitive trajectory library (LLM-generated)