Overview
Results are strong on the reported internal test set and public benchmarks, but reliance on private training data and unknown performance on novel attack styles lower certainty for unseen real-world inputs.
Citations0
Evidence Strength0.85
Confidence0.85
Risk Signals9
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 3/4
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 80%
Novelty: 50%
Why It Matters For Business
Sentinel materially reduces successful prompt injections on evaluated benchmarks, enabling safer prompt-driven products and lowering the risk of harmful or leaking responses.
Who Should Care
Summary TLDR
Sentinel is a binary classifier built by fine-tuning answerdotai/ModernBERT-large to detect prompt-injection and jailbreak prompts. Trained on a large mix of open and private datasets (≈70% benign / 30% jailbreak) and evaluated on a held-out internal test set and four public benchmarks, Sentinel scores 0.987 accuracy and 0.980 F1 on the internal test set and an average F1 of 0.938 on public benchmarks. It outperforms a strong DeBERTa-v3 baseline by wide margins on evaluated sets. Limitations: private training data reduces reproducibility and highly novel attacks may evade detection.
Problem Statement
LLMs can be tricked by malicious inputs that hide instructions (prompt injection). Existing detectors often overfit to known attacks and fail on new or diverse jailbreaks. The paper aims to build a robust detector that generalizes across varied injection styles.
Main Contribution
A production-ready detector (qualifire/prompt-injection-sentinel) fine-tuned from ModernBERT-large for binary jailbreak detection.
A curated multi-source training corpus combining several open datasets and a private synthetic set, with a 70/30 benign-to-jailbreak composition.
Key Findings
High internal detection accuracy and F1.
Strong cross-benchmark generalization on public datasets.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| AvgAcc (internal) | 0.987 (Sentinel) | 0.848 (DeBERTa baseline) | +0.139 | Internal held-out test set (Table 1) | Sentinel AvgAcc 0.987 vs 0.848 for baseline | Table 1 |
| F1 (internal) | 0.980 (Sentinel) | 0.728 (DeBERTa baseline) | +0.252 | Internal held-out test set (Table 1) | Sentinel F1 0.980 vs 0.728 for baseline | Table 1 |
What To Try In 7 Days
Run qualifire/prompt-injection-sentinel on your live prompt stream to measure current injection rates.
Compare Sentinel's outputs to your existing filter (or protectai baseline) and track false positives vs false negatives.
Log mistaken cases and add new examples to a retraining queue to reduce blind spots.
Optimization Features
Inference Optimization
Reproducibility
Code URLs
Data URLs
Risks & Boundaries
Limitations
May miss highly novel attack vectors not represented in training data.
Private dataset components prevent exact reproduction of training data and results.
When Not To Use
When you need provable or formally verified defenses against adversarial inputs.
When your threat model expects entirely new attack families not in the training mix.
Failure Modes
False positives: unusual formatting, assertive or security-like phrasing in benign prompts.
False negatives: subtle adversarial phrasing that differs from known patterns.

