Overview
Production Readiness
0.4
Novelty Score
0.6
Cost Impact Score
0.5
Citation Count
1
Why It Matters For Business
Automating realistic workload data reduces expert labor and enables faster, cheaper safety testing and training for multi-operator control rooms.
Summary TLDR
This paper introduces WELLA, a workflow that fine-tunes Qwen2.5-7B to simulate operator cognition and predict workload in multi-operator nuclear control-room scenarios. Authors collect NASA-TLX and SART labels from an HTGR simulator (startup 28, shutdown 11, accident 30 cases), generate virtual cognitive trajectories via Claude, then SFT the model with Llama-factory. WELLA yields strong accuracy on simulator-derived labels (e.g., R2 up to 0.9628 for RO3) and beats baseline LLMs (GPT-4, GPT-4o, Claude-3.5-Sonnet) on MAE/RMSE in these scenario tests. Limitations: scenario-limited data, operator-specific traits not modeled, and synthetic trajectories may inherit generator bias.
Problem Statement
HRA (human reliability analysis) lacks dynamic, fine-grained workload data. Existing approaches are static, expert-driven, or labor-intensive. The paper seeks an automated, scenario-based method to generate dynamic workload and situational-awareness labels for multi-agent operations.
Main Contribution
A pipeline (WELLA) that uses LLMs and multi-agent role play to generate dynamic workload and situational-awareness data.
Fine-tuning Qwen2.5-7B with supervised data (NASA-TLX and SART) using Llama-factory to create a domain model for workload prediction.
A method to generate virtual cognitive trajectories with an LLM (Claude) and use them as SFT training data for simulation agents.
Empirical comparison showing WELLA outperforms general-purpose LLMs (GPT-4, GPT-4o, Claude-3.5-Sonnet) on simulator-collected workload labels.
Key Findings
WELLA predicts per-role workload with very high fit for RO3.
WELLA outperforms commercial LLMs on aggregate simulator data.
Training uses real operator surveys and limited scenario counts.
Implementation details: small-batch SFT on 2 A800 GPUs.
Results
RO3 R2
ALL MAE
RO1 R2
Who Should Care
What To Try In 7 Days
Collect a small set of simulator TLX/SART labels for a target scenario.
Fine-tune a public LLM (e.g., Qwen2.5-7B) with Llama-factory on that data and test MAE/R2 against simple baselines.
Design per-role prompts and run a few multi-agent role-play simulations to inspect predicted workload trajectories.
Agent Features
Memory
- Virtual cognitive trajectory (short-term role reasoning)
Planning
- Scenario-driven role simulation
Tool Use
- Llama-factory
- Claude (for trajectory generation)
Frameworks
- Llama-factory
Is Agentic
true
Architectures
- Multi-agent LLM roles
- SFT
Collaboration
- Multi-role coordination (RO1/RO2/RO3/CO/SO)
Optimization Features
Training Optimization
- SFT
Reproducibility
Open Source Status
- partial
Risks & Boundaries
Limitations
- Training and evaluation limited to HTGR simulator scenarios and modest case counts.
- Virtual cognitive trajectories were generated by another LLM (Claude), which may introduce bias.
- Operator-specific traits and real-world deployment variability are not modeled.
- Baseline model configurations and access details for commercial models are not fully specified.
When Not To Use
- Do not use as a live control-room decision tool without human oversight and further validation.
- Not suitable for domains with no similar simulator data or different operator workflows.
- Avoid relying on WELLA alone for safety-critical certification or regulatory decisions.
Failure Modes
- Hallucinated or implausible cognitive steps from generated trajectories.
- Poor predictions for roles or scenarios that were under-described (noted for SO).
- Overfitting to simulator-specific phrasing or scenario templates.
Core Entities
Models
- Qwen2.5-7B
- WELLA (fine-tuned Qwen2.5-7B)
Metrics
- R2
- RMSE
- MAE
- EV
Datasets
- HTGR simulator operator data (NASA-TLX, SART)
- Virtual cognitive trajectory library (LLM-generated)

