HyCE: run validated HPC commands inside RAG so an LLM answers user-specific cluster questions

December 9, 20246 min

Overview

Decision SnapshotNeeds Validation

HyCE is a practical, incremental method that improves RAG by adding live command outputs; evidence is from a synthetic 100-pair evaluation and one on-prem cluster test.

Citations1

Evidence Strength0.60

Confidence0.70

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 2/3

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 60%

Authors

Yusuke Miyashita, Patrick Kin Man Tung, Johan Barthélemy

Links

Abstract / PDF / Code

Why It Matters For Business

HyCE reduces user confusion and support load by letting an LLM provide live, user-specific cluster answers without expensive model fine-tuning.

Who Should Care

Summary TLDR

This paper introduces HyCE (Hypothetical Command Embeddings), an extension to Retrieval-Augmented Generation (RAG) that embeds command descriptions, executes validated shell commands, and feeds their outputs to an LLM so answers reflect the user's real HPC environment. HyCE raises automatic RAG eval from 77.67% to 82.33% (+4.66%) on synthetic HPC Q&A, improves semantic matching to command descriptions, and includes layered security (command whitelists, containers, restricted privileges). The code is open-sourced for prototype deployment.

Problem Statement

HPC users need precise, real-time answers about their specific cluster (available GPUs, job status, software). Standard RAG pulls static docs but cannot access live, user-specific command outputs. Fine-tuning models is costly. The result: LLMs give vague or incorrect answers for practical HPC queries.

Main Contribution

HyCE: embed descriptive command texts, retrieve matching commands, execute vetted commands, and include outputs in RAG context.

An automated evaluation pipeline where an LLM generates and filters synthetic HPC Q&A and serves as a judge for RAG answers.

Key Findings

Adding HyCE to a baseline RAG raised the automatic evaluation score.

Numbers77.67%82.33%+4.66%)

Practical UseIf you add HyCE, expect modest but measurable improvement in RAG answers on evaluated HPC Q&A.

Evidence RefTable 2

Further pipeline improvements (better retrieval/re-rank + CoT) increased scores to 86%.

Numbers82.33%86%+3.67% from HyCE baseline)

Practical UseHyCE is complementary: improving retrieval and prompts stacks additional gains on top of HyCE.

Evidence RefTable 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Automatic RAG Eval Score (baseline)77.67%synthetic_hpc_qas (100 pairs)Table 2 reports baseline RAG scoreTable 2
Automatic RAG Eval Score (+HyCE)82.33%77.67%+4.66%synthetic_hpc_qas (100 pairs)Table 2 shows HyCE improves baseline by 4.66%Table 2

What To Try In 7 Days

Prototype HyCE on a mirror environment: index command descriptions and a small doc set.

Whitelist a small set of safe commands and containerize execution.

Generate ~100 synthetic Q&A with the provided prompts and run the automated eval to measure baseline vs HyCE.

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Evaluation relies on synthetic Q&A derived from provided chunks; does not measure generalization beyond those chunks.

Potential hallucinations for queries outside indexed docs or command outputs.

When Not To Use

When you cannot safely whitelist and validate commands.

When user queries frequently require knowledge outside your documentation or command outputs.

Failure Modes

Hallucinations for out-of-chunk questions.

Incorrect or unsafe actions if the command whitelist is incomplete or mis-specified.

Core Entities

Models

nvidia/llama-3.2-nv-embedqa-1b-v1sentence-transformers/multi-qa-MiniLM-L6-cos-v1nvidia/llama-3.2-nv-rerankqa-1b-v1cross-encoder/ms-marco-MiniLM-L-12-v2meta/llama-3.1-405b-instructgpt-4o-2024-08-06

Metrics

Automatic RAG Eval Score (%)Semantic similarity (cross-encoder scores)Binary Correctness and Faithfulness scores

Datasets

synthetic_hpc_qas (100 pairs: 90 docs, 10 commands)