Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.5
Citation Count
1
Why It Matters For Business
HyCE reduces user confusion and support load by letting an LLM provide live, user-specific cluster answers without expensive model fine-tuning.
Summary TLDR
This paper introduces HyCE (Hypothetical Command Embeddings), an extension to Retrieval-Augmented Generation (RAG) that embeds command descriptions, executes validated shell commands, and feeds their outputs to an LLM so answers reflect the user's real HPC environment. HyCE raises automatic RAG eval from 77.67% to 82.33% (+4.66%) on synthetic HPC Q&A, improves semantic matching to command descriptions, and includes layered security (command whitelists, containers, restricted privileges). The code is open-sourced for prototype deployment.
Problem Statement
HPC users need precise, real-time answers about their specific cluster (available GPUs, job status, software). Standard RAG pulls static docs but cannot access live, user-specific command outputs. Fine-tuning models is costly. The result: LLMs give vague or incorrect answers for practical HPC queries.
Main Contribution
HyCE: embed descriptive command texts, retrieve matching commands, execute vetted commands, and include outputs in RAG context.
An automated evaluation pipeline where an LLM generates and filters synthetic HPC Q&A and serves as a judge for RAG answers.
Empirical test on an on-prem HPC cluster (Katana) showing HyCE improves automatic eval scores and semantic matching.
Security design: command whitelists, user-level privileges, containerization, and local-hosting options.
Open-source release of the pipeline and prompts to enable reproduction and prototyping.
Key Findings
Adding HyCE to a baseline RAG raised the automatic evaluation score.
Further pipeline improvements (better retrieval/re-rank + CoT) increased scores to 86%.
Matching queries to command descriptions raises semantic similarity versus matching to raw commands.
HyCE enables concrete user-specific answers (example: reports 'V100 and A100' GPUs available).
Results
Automatic RAG Eval Score (baseline)
Automatic RAG Eval Score (+HyCE)
Automatic RAG Eval Score (+HyCE + better retrieval & CoT)
Who Should Care
What To Try In 7 Days
Prototype HyCE on a mirror environment: index command descriptions and a small doc set.
Whitelist a small set of safe commands and containerize execution.
Generate ~100 synthetic Q&A with the provided prompts and run the automated eval to measure baseline vs HyCE.
Reproducibility
Code Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Evaluation relies on synthetic Q&A derived from provided chunks; does not measure generalization beyond those chunks.
- Potential hallucinations for queries outside indexed docs or command outputs.
- Local model hosting for privacy needs heavy compute and reduces HPC resources.
When Not To Use
- When you cannot safely whitelist and validate commands.
- When user queries frequently require knowledge outside your documentation or command outputs.
- When you lack compute to host models locally and must avoid sending sensitive data externally.
Failure Modes
- Hallucinations for out-of-chunk questions.
- Incorrect or unsafe actions if the command whitelist is incomplete or mis-specified.
- Stale documentation or command outputs leading to wrong advice.
Core Entities
Models
- nvidia/llama-3.2-nv-embedqa-1b-v1
- sentence-transformers/multi-qa-MiniLM-L6-cos-v1
- nvidia/llama-3.2-nv-rerankqa-1b-v1
- cross-encoder/ms-marco-MiniLM-L-12-v2
- meta/llama-3.1-405b-instruct
- gpt-4o-2024-08-06
Metrics
- Automatic RAG Eval Score (%)
- Semantic similarity (cross-encoder scores)
- Binary Correctness and Faithfulness scores
Datasets
- synthetic_hpc_qas (100 pairs: 90 docs, 10 commands)

