HyCE: run validated HPC commands inside RAG so an LLM answers user-specific cluster questions

December 9, 20246 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

1

Authors

Yusuke Miyashita, Patrick Kin Man Tung, Johan Barthélemy

Links

Abstract / PDF

Why It Matters For Business

HyCE reduces user confusion and support load by letting an LLM provide live, user-specific cluster answers without expensive model fine-tuning.

Summary TLDR

This paper introduces HyCE (Hypothetical Command Embeddings), an extension to Retrieval-Augmented Generation (RAG) that embeds command descriptions, executes validated shell commands, and feeds their outputs to an LLM so answers reflect the user's real HPC environment. HyCE raises automatic RAG eval from 77.67% to 82.33% (+4.66%) on synthetic HPC Q&A, improves semantic matching to command descriptions, and includes layered security (command whitelists, containers, restricted privileges). The code is open-sourced for prototype deployment.

Problem Statement

HPC users need precise, real-time answers about their specific cluster (available GPUs, job status, software). Standard RAG pulls static docs but cannot access live, user-specific command outputs. Fine-tuning models is costly. The result: LLMs give vague or incorrect answers for practical HPC queries.

Main Contribution

HyCE: embed descriptive command texts, retrieve matching commands, execute vetted commands, and include outputs in RAG context.

An automated evaluation pipeline where an LLM generates and filters synthetic HPC Q&A and serves as a judge for RAG answers.

Empirical test on an on-prem HPC cluster (Katana) showing HyCE improves automatic eval scores and semantic matching.

Security design: command whitelists, user-level privileges, containerization, and local-hosting options.

Open-source release of the pipeline and prompts to enable reproduction and prototyping.

Key Findings

Adding HyCE to a baseline RAG raised the automatic evaluation score.

Numbers77.67% → 82.33% (Δ +4.66%)

Further pipeline improvements (better retrieval/re-rank + CoT) increased scores to 86%.

Numbers82.33% → 86% (Δ +3.67% from HyCE baseline)

Matching queries to command descriptions raises semantic similarity versus matching to raw commands.

Numbersstsb-roberta-large: 0.0906 → 0.5285

HyCE enables concrete user-specific answers (example: reports 'V100 and A100' GPUs available).

NumbersN/A (qualitative example)

Results

Automatic RAG Eval Score (baseline)

Value77.67%

Automatic RAG Eval Score (+HyCE)

Value82.33%

Baseline77.67%

Automatic RAG Eval Score (+HyCE + better retrieval & CoT)

Value86%

Baseline82.33%

Who Should Care

What To Try In 7 Days

Prototype HyCE on a mirror environment: index command descriptions and a small doc set.

Whitelist a small set of safe commands and containerize execution.

Generate ~100 synthetic Q&A with the provided prompts and run the automated eval to measure baseline vs HyCE.

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Evaluation relies on synthetic Q&A derived from provided chunks; does not measure generalization beyond those chunks.
  • Potential hallucinations for queries outside indexed docs or command outputs.
  • Local model hosting for privacy needs heavy compute and reduces HPC resources.

When Not To Use

  • When you cannot safely whitelist and validate commands.
  • When user queries frequently require knowledge outside your documentation or command outputs.
  • When you lack compute to host models locally and must avoid sending sensitive data externally.

Failure Modes

  • Hallucinations for out-of-chunk questions.
  • Incorrect or unsafe actions if the command whitelist is incomplete or mis-specified.
  • Stale documentation or command outputs leading to wrong advice.

Core Entities

Models

  • nvidia/llama-3.2-nv-embedqa-1b-v1
  • sentence-transformers/multi-qa-MiniLM-L6-cos-v1
  • nvidia/llama-3.2-nv-rerankqa-1b-v1
  • cross-encoder/ms-marco-MiniLM-L-12-v2
  • meta/llama-3.1-405b-instruct
  • gpt-4o-2024-08-06

Metrics

  • Automatic RAG Eval Score (%)
  • Semantic similarity (cross-encoder scores)
  • Binary Correctness and Faithfulness scores

Datasets

  • synthetic_hpc_qas (100 pairs: 90 docs, 10 commands)