WELLA: fine-tuned LLM agents that generate dynamic workload estimates for multi‑operator nuclear control rooms

January 16, 20256 min

Overview

Production Readiness

0.4

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

1

Authors

Xingyu Xiao, Peng Chen, Qianqian Jia, Jiejuan Tong, Jingang Liang, Haitao Wang

Links

Abstract / PDF

Why It Matters For Business

Automating realistic workload data reduces expert labor and enables faster, cheaper safety testing and training for multi-operator control rooms.

Summary TLDR

This paper introduces WELLA, a workflow that fine-tunes Qwen2.5-7B to simulate operator cognition and predict workload in multi-operator nuclear control-room scenarios. Authors collect NASA-TLX and SART labels from an HTGR simulator (startup 28, shutdown 11, accident 30 cases), generate virtual cognitive trajectories via Claude, then SFT the model with Llama-factory. WELLA yields strong accuracy on simulator-derived labels (e.g., R2 up to 0.9628 for RO3) and beats baseline LLMs (GPT-4, GPT-4o, Claude-3.5-Sonnet) on MAE/RMSE in these scenario tests. Limitations: scenario-limited data, operator-specific traits not modeled, and synthetic trajectories may inherit generator bias.

Problem Statement

HRA (human reliability analysis) lacks dynamic, fine-grained workload data. Existing approaches are static, expert-driven, or labor-intensive. The paper seeks an automated, scenario-based method to generate dynamic workload and situational-awareness labels for multi-agent operations.

Main Contribution

A pipeline (WELLA) that uses LLMs and multi-agent role play to generate dynamic workload and situational-awareness data.

Fine-tuning Qwen2.5-7B with supervised data (NASA-TLX and SART) using Llama-factory to create a domain model for workload prediction.

A method to generate virtual cognitive trajectories with an LLM (Claude) and use them as SFT training data for simulation agents.

Empirical comparison showing WELLA outperforms general-purpose LLMs (GPT-4, GPT-4o, Claude-3.5-Sonnet) on simulator-collected workload labels.

Key Findings

WELLA predicts per-role workload with very high fit for RO3.

NumbersRO3 R2=0.9628, RMSE=3.5327, MAE=1.92

WELLA outperforms commercial LLMs on aggregate simulator data.

NumbersALL data: WELLA R2=0.3822, MAE=4.5161 vs GPT-4 MAE=30.1935

Training uses real operator surveys and limited scenario counts.

NumbersScenario instances: Startup 28, Shutdown 11, Accident 30

Implementation details: small-batch SFT on 2 A800 GPUs.

Numbers2×A800 GPUs, batch size=2, lr=1e-5, epochs=8

Results

RO3 R2

Value0.9628

BaselineGPT-4 R2 = -2.0091

ALL MAE

Value4.5161

BaselineGPT-4 MAE = 30.1935

RO1 R2

Value0.9012

BaselineGPT-4 R2 = -0.7107

Who Should Care

What To Try In 7 Days

Collect a small set of simulator TLX/SART labels for a target scenario.

Fine-tune a public LLM (e.g., Qwen2.5-7B) with Llama-factory on that data and test MAE/R2 against simple baselines.

Design per-role prompts and run a few multi-agent role-play simulations to inspect predicted workload trajectories.

Agent Features

Memory

  • Virtual cognitive trajectory (short-term role reasoning)

Planning

  • Scenario-driven role simulation

Tool Use

  • Llama-factory
  • Claude (for trajectory generation)

Frameworks

  • Llama-factory

Is Agentic

true

Architectures

  • Multi-agent LLM roles
  • SFT

Collaboration

  • Multi-role coordination (RO1/RO2/RO3/CO/SO)

Optimization Features

Training Optimization

  • SFT

Reproducibility

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Training and evaluation limited to HTGR simulator scenarios and modest case counts.
  • Virtual cognitive trajectories were generated by another LLM (Claude), which may introduce bias.
  • Operator-specific traits and real-world deployment variability are not modeled.
  • Baseline model configurations and access details for commercial models are not fully specified.

When Not To Use

  • Do not use as a live control-room decision tool without human oversight and further validation.
  • Not suitable for domains with no similar simulator data or different operator workflows.
  • Avoid relying on WELLA alone for safety-critical certification or regulatory decisions.

Failure Modes

  • Hallucinated or implausible cognitive steps from generated trajectories.
  • Poor predictions for roles or scenarios that were under-described (noted for SO).
  • Overfitting to simulator-specific phrasing or scenario templates.

Core Entities

Models

  • Qwen2.5-7B
  • WELLA (fine-tuned Qwen2.5-7B)

Metrics

  • R2
  • RMSE
  • MAE
  • EV

Datasets

  • HTGR simulator operator data (NASA-TLX, SART)
  • Virtual cognitive trajectory library (LLM-generated)