Overview
HyCE is a practical, incremental method that improves RAG by adding live command outputs; evidence is from a synthetic 100-pair evaluation and one on-prem cluster test.
Citations1
Evidence Strength0.60
Confidence0.70
Risk Signals9
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 2/3
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 50%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
HyCE reduces user confusion and support load by letting an LLM provide live, user-specific cluster answers without expensive model fine-tuning.
Who Should Care
Summary TLDR
This paper introduces HyCE (Hypothetical Command Embeddings), an extension to Retrieval-Augmented Generation (RAG) that embeds command descriptions, executes validated shell commands, and feeds their outputs to an LLM so answers reflect the user's real HPC environment. HyCE raises automatic RAG eval from 77.67% to 82.33% (+4.66%) on synthetic HPC Q&A, improves semantic matching to command descriptions, and includes layered security (command whitelists, containers, restricted privileges). The code is open-sourced for prototype deployment.
Problem Statement
HPC users need precise, real-time answers about their specific cluster (available GPUs, job status, software). Standard RAG pulls static docs but cannot access live, user-specific command outputs. Fine-tuning models is costly. The result: LLMs give vague or incorrect answers for practical HPC queries.
Main Contribution
HyCE: embed descriptive command texts, retrieve matching commands, execute vetted commands, and include outputs in RAG context.
An automated evaluation pipeline where an LLM generates and filters synthetic HPC Q&A and serves as a judge for RAG answers.
Key Findings
Adding HyCE to a baseline RAG raised the automatic evaluation score.
Further pipeline improvements (better retrieval/re-rank + CoT) increased scores to 86%.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Automatic RAG Eval Score (baseline) | 77.67% | — | — | synthetic_hpc_qas (100 pairs) | Table 2 reports baseline RAG score | Table 2 |
| Automatic RAG Eval Score (+HyCE) | 82.33% | 77.67% | +4.66% | synthetic_hpc_qas (100 pairs) | Table 2 shows HyCE improves baseline by 4.66% | Table 2 |
What To Try In 7 Days
Prototype HyCE on a mirror environment: index command descriptions and a small doc set.
Whitelist a small set of safe commands and containerize execution.
Generate ~100 synthetic Q&A with the provided prompts and run the automated eval to measure baseline vs HyCE.
Reproducibility
Risks & Boundaries
Limitations
Evaluation relies on synthetic Q&A derived from provided chunks; does not measure generalization beyond those chunks.
Potential hallucinations for queries outside indexed docs or command outputs.
When Not To Use
When you cannot safely whitelist and validate commands.
When user queries frequently require knowledge outside your documentation or command outputs.
Failure Modes
Hallucinations for out-of-chunk questions.
Incorrect or unsafe actions if the command whitelist is incomplete or mis-specified.

