Open-source, low-cost platform that secures RAG chatbots for small businesses using k3s clusters and layered prompt-defences

January 21, 20267 min

Overview

Production Readiness

0.78

Novelty Score

0.48

Cost Impact Score

0.8

Citation Count

0

Authors

Jiazhu Xie, Bowen Li, Heyu Fu, Chong Gao, Ziqi Xu, Fengling Han

Links

Abstract / PDF

Why It Matters For Business

Small businesses can run secure, low-cost RAG chatbots on commodity hardware while keeping strong tenant isolation and practical defenses against prompt injection.

Summary TLDR

This paper presents an open-source, multi-tenant platform for small businesses to deploy retrieval-augmented chatbots on low-cost, distributed k3s clusters. Security is handled at the platform level with container isolation, PII screening, guard prompts, and a pre-generation detector (GenTel-Shield). In an e-commerce case study, guard prompts alone give near-100% recall; GenTel-Shield achieves high precision (99.51%) and moderate recall (81.6%); combined defenses reach ~100% recall and ~99.8% F1. The k3s private cloud matched or reduced latency versus bare-metal for evaluated LLMs.

Problem Statement

Small businesses lack budget and engineering staff to run cloud GPU fleets and to harden RAG chatbots against prompt injection and data leakage; we need a low-cost, deployable platform that enforces tenant isolation and practical prompt-injection defenses without retraining models.

Main Contribution

An open-source, multi-tenant platform built on lightweight k3s clusters and an encrypted overlay network for small-business LLM deployments.

A layered, platform-level prompt-injection mitigation combining system-level guard prompts and the GenTel-Shield detector that avoids model retraining.

A real-world e-commerce case study showing security effectiveness and inference latency comparisons between bare-metal and k3s private-cloud deployments.

Key Findings

Guard prompts block prompt-injection attacks almost perfectly in the case study.

NumbersRecall 99.6–100%, F1 ~100% (Table 1)

GenTel-Shield provides model-agnostic detection with high precision but misses some attacks.

NumbersPrecision 99.51%, Recall 81.6%, F1 ~89.7% (Table 1)

Combining guard prompts and GenTel-Shield gives the most robust defense.

NumbersRecall 100%, F1 ~99.8% across models (Table 1)

k3s-based private cloud lowered end-to-end inference latency versus bare-metal in the case study.

NumbersLatency reduced by ~28%–60% depending on model (Table 2)

Base LLM safety alone is insufficient to stop prompt-injection attacks.

NumbersPure LLM recall 0.4–1.2% with perfect precision (Table 1)

Results

Guard Prompts recall

Value99.6–100%

BaselinePure LLM

GenTel-Shield precision / recall / F1

Value99.51% / 81.6% / ~89.7%

BaselinePure LLM

Combined defenses (Guard + GenTel-Shield) F1

Value~99.8% (F1)

BaselinePure LLM

End-to-end latency (GPT-4.1-mini)

ValueBare-metal 338.90s → Private cloud 243.62s

BaselineBare-metal

End-to-end latency (GPT-4.1)

ValueBare-metal 447.60s → Private cloud 242.98s

BaselineBare-metal

End-to-end latency (Ministral-3B)

ValueBare-metal 645.98s → Private cloud 246.22s

BaselineBare-metal

Who Should Care

What To Try In 7 Days

Run a k3s demo cluster on spare machines to test private-cloud deployment.

Add simple system-level guard prompts to existing LLM prompts and test with known injection samples.

Integrate a model-agnostic pre-generation detector (e.g., GenTel-Shield) and measure false positives and missed attacks.

Optimization Features

Infra Optimization

  • k3s lightweight Kubernetes
  • Pooling commodity hardware to reduce cost

System Optimization

  • Encrypted overlay networking
  • Multi-tenant container isolation

Inference Optimization

  • GPU-aware scheduling
  • Distributed inference across heterogeneous nodes
  • Containerised workload placement

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Guard prompts require manual, scenario-specific tuning and may not generalise to new domains.
  • GenTel-Shield misses some attacks (recall ~81.6%) when used alone.
  • Case study is a single-tenant e-commerce deployment; results may vary for other industries and scales.
  • Operational setup assumes some on-site hardware and network reliability; misconfiguration can widen blast radius.

When Not To Use

  • High-stakes autonomous decision systems where full formal verification is required.
  • Unrestricted creative generation tasks where prompt constraints would block desired outputs.
  • Contexts with strict legal requirements that prohibit local data pooling or shared multi-tenant infrastructure.

Failure Modes

  • Detector false negatives allow obfuscated injection to reach the model.
  • Guard prompts can be bypassed by novel or domain-specific obfuscation without prompt updates.
  • Cross-tenant leakage if container isolation or access controls are misconfigured.
  • Network instability can cause latency spikes that hurt user experience.

Core Entities

Models

  • GPT-4.1
  • GPT-4.1-mini
  • Ministral-3B
  • GenTel-Shield (detector)

Metrics

  • Precision
  • Recall
  • F1
  • End-to-end inference latency (s)

Datasets

  • Customer Support queries (ATS)
  • GenTel-Safe prompt-injection attack dataset (adapted)

Benchmarks

  • GenTel-Safe