Overview
Results show strong accuracy on simulator labels, but evidence is limited to specific HTGR scenarios and synthetic cognitive trajectories; broader validation is needed before production use.
Citations1
Evidence Strength0.60
Confidence0.80
Risk Signals10
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 3/3
Reproducibility
Status: No open assets linked
Open source: Partial
At A Glance
Cost impact: 50%
Production readiness: 40%
Novelty: 60%
Why It Matters For Business
Automating realistic workload data reduces expert labor and enables faster, cheaper safety testing and training for multi-operator control rooms.
Who Should Care
Summary TLDR
This paper introduces WELLA, a workflow that fine-tunes Qwen2.5-7B to simulate operator cognition and predict workload in multi-operator nuclear control-room scenarios. Authors collect NASA-TLX and SART labels from an HTGR simulator (startup 28, shutdown 11, accident 30 cases), generate virtual cognitive trajectories via Claude, then SFT the model with Llama-factory. WELLA yields strong accuracy on simulator-derived labels (e.g., R2 up to 0.9628 for RO3) and beats baseline LLMs (GPT-4, GPT-4o, Claude-3.5-Sonnet) on MAE/RMSE in these scenario tests. Limitations: scenario-limited data, operator-specific traits not modeled, and synthetic trajectories may inherit generator bias.
Problem Statement
HRA (human reliability analysis) lacks dynamic, fine-grained workload data. Existing approaches are static, expert-driven, or labor-intensive. The paper seeks an automated, scenario-based method to generate dynamic workload and situational-awareness labels for multi-agent operations.
Main Contribution
A pipeline (WELLA) that uses LLMs and multi-agent role play to generate dynamic workload and situational-awareness data.
Fine-tuning Qwen2.5-7B with supervised data (NASA-TLX and SART) using Llama-factory to create a domain model for workload prediction.
Key Findings
WELLA predicts per-role workload with very high fit for RO3.
WELLA outperforms commercial LLMs on aggregate simulator data.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| RO3 R2 | 0.9628 | GPT-4 R2 = -2.0091 | +2.9719 | Scenario RO3 | Table 3: WELLA R2=0.9628 vs GPT-4 R2=-2.0091 | Table 3 |
| ALL MAE | 4.5161 | GPT-4 MAE = 30.1935 | -25.6774 | Combined dataset (ALL) | Table 6: WELLA MAE=4.5161 vs GPT-4 MAE=30.1935 | Table 6 |
What To Try In 7 Days
Collect a small set of simulator TLX/SART labels for a target scenario.
Fine-tune a public LLM (e.g., Qwen2.5-7B) with Llama-factory on that data and test MAE/R2 against simple baselines.
Design per-role prompts and run a few multi-agent role-play simulations to inspect predicted workload trajectories.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Optimization Features
Training Optimization
Reproducibility
Risks & Boundaries
Limitations
Training and evaluation limited to HTGR simulator scenarios and modest case counts.
Virtual cognitive trajectories were generated by another LLM (Claude), which may introduce bias.
When Not To Use
Do not use as a live control-room decision tool without human oversight and further validation.
Not suitable for domains with no similar simulator data or different operator workflows.
Failure Modes
Hallucinated or implausible cognitive steps from generated trajectories.
Poor predictions for roles or scenarios that were under-described (noted for SO).

