Overview
Production Readiness
0.8
Novelty Score
0.5
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
Sentinel materially reduces successful prompt injections on evaluated benchmarks, enabling safer prompt-driven products and lowering the risk of harmful or leaking responses.
Summary TLDR
Sentinel is a binary classifier built by fine-tuning answerdotai/ModernBERT-large to detect prompt-injection and jailbreak prompts. Trained on a large mix of open and private datasets (≈70% benign / 30% jailbreak) and evaluated on a held-out internal test set and four public benchmarks, Sentinel scores 0.987 accuracy and 0.980 F1 on the internal test set and an average F1 of 0.938 on public benchmarks. It outperforms a strong DeBERTa-v3 baseline by wide margins on evaluated sets. Limitations: private training data reduces reproducibility and highly novel attacks may evade detection.
Problem Statement
LLMs can be tricked by malicious inputs that hide instructions (prompt injection). Existing detectors often overfit to known attacks and fail on new or diverse jailbreaks. The paper aims to build a robust detector that generalizes across varied injection styles.
Main Contribution
A production-ready detector (qualifire/prompt-injection-sentinel) fine-tuned from ModernBERT-large for binary jailbreak detection.
A curated multi-source training corpus combining several open datasets and a private synthetic set, with a 70/30 benign-to-jailbreak composition.
A comparative evaluation showing large gains over a strong DeBERTa-v3 baseline on an internal held-out set and multiple public benchmarks.
Key Findings
High internal detection accuracy and F1.
Strong cross-benchmark generalization on public datasets.
Low inference latency on tested hardware.
Training mix and split documented; contains private data.
Results
AvgAcc (internal)
F1 (internal)
Avg F1 (public benchmarks)
Latency per inference
Who Should Care
What To Try In 7 Days
Run qualifire/prompt-injection-sentinel on your live prompt stream to measure current injection rates.
Compare Sentinel's outputs to your existing filter (or protectai baseline) and track false positives vs false negatives.
Log mistaken cases and add new examples to a retraining queue to reduce blind spots.
Optimization Features
Inference Optimization
- Uses ModernBERT features like FlashAttention and unpadding for faster inference
Reproducibility
Code Urls
Data Urls
- https://huggingface.co/datasets/alespalla/chatbot-instruction-prompts
- https://huggingface.co/datasets/microsoft/orca-agentinstruct-1M-v1
- https://huggingface.co/datasets/OpenSafetyLab/Salad-Data
- https://huggingface.co/datasets/deepset/prompt-injections
- https://huggingface.co/datasets/reshabhs/SPML-Chatbot-Prompt-Injection
- https://huggingface.co/datasets/jackhhao/jailbreak-classification
- https://huggingface.co/datasets/lmsys/toxic-chat
- https://huggingface.co/datasets/VMware/open-instruct
Code Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- May miss highly novel attack vectors not represented in training data.
- Private dataset components prevent exact reproduction of training data and results.
- Error analysis found no clear recurring mistake patterns, making targeted fixes harder.
When Not To Use
- When you need provable or formally verified defenses against adversarial inputs.
- When your threat model expects entirely new attack families not in the training mix.
- When full reproducibility of training data is a strict requirement.
Failure Modes
- False positives: unusual formatting, assertive or security-like phrasing in benign prompts.
- False negatives: subtle adversarial phrasing that differs from known patterns.
- Bias toward attack types seen in the training corpus.
Core Entities
Models
- answerdotai/ModernBERT-large
- qualifire/prompt-injection-sentinel
- protectai/deberta-v3-base-prompt-injection-v2
Metrics
- Accuracy
- F1-score
- Recall
- Precision
Datasets
- OpenSafetyLab/Salad-Data
- alespalla/chatbot-instruction-prompts
- microsoft/orca-agentinstruct-1M-v1
- verazuo/jailbreak-llms
- lmsys/toxic-chat
- VMware/open-instruct
- reshabhs/SPML-Chatbot-Prompt-Injection
- qualifire-synthetics (private)
- qualifire/Qualifire-prompt-injection-benchmark
Benchmarks
- Internal held-out test set (10%)
- allenai/wildjailbreak
- jackhhao/jailbreak-classification
- deepset/prompt-injections
- qualifire/Qualifire-prompt-injection-benchmark

