HyCE: run validated HPC commands inside RAG so an LLM answers user-specific cluster questions

Overview

Decision SnapshotNeeds Validation

HyCE is a practical, incremental method that improves RAG by adding live command outputs; evidence is from a synthetic 100-pair evaluation and one on-prem cluster test.

Citations1

Evidence Strength0.60

Confidence0.70

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 2/3

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 60%

Authors

Yusuke Miyashita, Patrick Kin Man Tung, Johan Barthélemy

Links

Abstract / PDF / Code

Why It Matters For Business

HyCE reduces user confusion and support load by letting an LLM provide live, user-specific cluster answers without expensive model fine-tuning.

Who Should Care

CTO ML Engineer Engineering Lead Data Scientist

Summary TLDR

This paper introduces HyCE (Hypothetical Command Embeddings), an extension to Retrieval-Augmented Generation (RAG) that embeds command descriptions, executes validated shell commands, and feeds their outputs to an LLM so answers reflect the user's real HPC environment. HyCE raises automatic RAG eval from 77.67% to 82.33% (+4.66%) on synthetic HPC Q&A, improves semantic matching to command descriptions, and includes layered security (command whitelists, containers, restricted privileges). The code is open-sourced for prototype deployment.

Problem Statement

HPC users need precise, real-time answers about their specific cluster (available GPUs, job status, software). Standard RAG pulls static docs but cannot access live, user-specific command outputs. Fine-tuning models is costly. The result: LLMs give vague or incorrect answers for practical HPC queries.

Main Contribution

HyCE: embed descriptive command texts, retrieve matching commands, execute vetted commands, and include outputs in RAG context.

An automated evaluation pipeline where an LLM generates and filters synthetic HPC Q&A and serves as a judge for RAG answers.

Key Findings

Adding HyCE to a baseline RAG raised the automatic evaluation score.

Numbers77.67% → 82.33% (Δ +4.66%)

Practical UseIf you add HyCE, expect modest but measurable improvement in RAG answers on evaluated HPC Q&A.

Evidence RefTable 2

Further pipeline improvements (better retrieval/re-rank + CoT) increased scores to 86%.

Numbers82.33% → 86% (Δ +3.67% from HyCE baseline)

Practical UseHyCE is complementary: improving retrieval and prompts stacks additional gains on top of HyCE.

Evidence RefTable 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Automatic RAG Eval Score (baseline)	77.67%	—	—	synthetic_hpc_qas (100 pairs)	Table 2 reports baseline RAG score	Table 2
Automatic RAG Eval Score (+HyCE)	82.33%	77.67%	+4.66%	synthetic_hpc_qas (100 pairs)	Table 2 shows HyCE improves baseline by 4.66%	Table 2

What To Try In 7 Days

Prototype HyCE on a mirror environment: index command descriptions and a small doc set.

Whitelist a small set of safe commands and containerize execution.

Generate ~100 synthetic Q&A with the provided prompts and run the automated eval to measure baseline vs HyCE.

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/Yusuke710/llm_rag_eval_hpc

Risks & Boundaries

Limitations

Evaluation relies on synthetic Q&A derived from provided chunks; does not measure generalization beyond those chunks.

Potential hallucinations for queries outside indexed docs or command outputs.

When Not To Use

When you cannot safely whitelist and validate commands.

When user queries frequently require knowledge outside your documentation or command outputs.

Failure Modes

Hallucinations for out-of-chunk questions.

Incorrect or unsafe actions if the command whitelist is incomplete or mis-specified.

Core Entities

Models

nvidia/llama-3.2-nv-embedqa-1b-v1sentence-transformers/multi-qa-MiniLM-L6-cos-v1nvidia/llama-3.2-nv-rerankqa-1b-v1cross-encoder/ms-marco-MiniLM-L-12-v2meta/llama-3.1-405b-instructgpt-4o-2024-08-06

Metrics

Automatic RAG Eval Score (%)Semantic similarity (cross-encoder scores)Binary Correctness and Faithfulness scores

Datasets

synthetic_hpc_qas (100 pairs: 90 docs, 10 commands)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Adding HyCE to a baseline RAG raised the automatic evaluation score.

Further pipeline improvements (better retrieval/re-rank + CoT) increased scores to 86%.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

You May Also Want to Read

Add explicit, verifiable rationales and reranking to RAG to cut hallucinations in biomedical QA

Key finding

Teach LLMs to spot and avoid context-based hallucinations by masking retrieval heads and contrastive tuning

Key finding

Fin-RATE: a realistic SEC-filings benchmark that stresses cross-document, cross-year and cross-company financial reasoning

Key finding

Not all retrieval noise is bad: some noises consistently help LLMs, others break them

Key finding

Marathon: a multiple-choice benchmark that stresses LLMs with very long documents (up to ~260K chars)

Key finding