Sentinel — a ModernBERT detector that flags prompt injections with ~98% F1 on internal tests

June 5, 20256 min

Overview

Decision SnapshotReady For Pilot

Results are strong on the reported internal test set and public benchmarks, but reliance on private training data and unknown performance on novel attack styles lower certainty for unseen real-world inputs.

Citations0

Evidence Strength0.85

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 80%

Novelty: 50%

Authors

Dror Ivry, Oran Nahum

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Sentinel materially reduces successful prompt injections on evaluated benchmarks, enabling safer prompt-driven products and lowering the risk of harmful or leaking responses.

Who Should Care

Summary TLDR

Sentinel is a binary classifier built by fine-tuning answerdotai/ModernBERT-large to detect prompt-injection and jailbreak prompts. Trained on a large mix of open and private datasets (≈70% benign / 30% jailbreak) and evaluated on a held-out internal test set and four public benchmarks, Sentinel scores 0.987 accuracy and 0.980 F1 on the internal test set and an average F1 of 0.938 on public benchmarks. It outperforms a strong DeBERTa-v3 baseline by wide margins on evaluated sets. Limitations: private training data reduces reproducibility and highly novel attacks may evade detection.

Problem Statement

LLMs can be tricked by malicious inputs that hide instructions (prompt injection). Existing detectors often overfit to known attacks and fail on new or diverse jailbreaks. The paper aims to build a robust detector that generalizes across varied injection styles.

Main Contribution

A production-ready detector (qualifire/prompt-injection-sentinel) fine-tuned from ModernBERT-large for binary jailbreak detection.

A curated multi-source training corpus combining several open datasets and a private synthetic set, with a 70/30 benign-to-jailbreak composition.

Key Findings

High internal detection accuracy and F1.

NumbersAvgAcc 0.987, F1 0.980 on internal held-out test

Practical UseUse Sentinel to improve prompt-injection detection on similarly distributed inputs and datasets.

Evidence RefTable 1 (Internal Held-Out Test Set)

Strong cross-benchmark generalization on public datasets.

NumbersAvg public F1 0.938 vs baseline 0.709≈0.229)

Practical UseExpect better detection across multiple standard jailbreak benchmarks compared to the evaluated DeBERTa baseline.

Evidence RefTable 2 (Public Benchmarks)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
AvgAcc (internal)0.987 (Sentinel)0.848 (DeBERTa baseline)+0.139Internal held-out test set (Table 1)Sentinel AvgAcc 0.987 vs 0.848 for baselineTable 1
F1 (internal)0.980 (Sentinel)0.728 (DeBERTa baseline)+0.252Internal held-out test set (Table 1)Sentinel F1 0.980 vs 0.728 for baselineTable 1

What To Try In 7 Days

Run qualifire/prompt-injection-sentinel on your live prompt stream to measure current injection rates.

Compare Sentinel's outputs to your existing filter (or protectai baseline) and track false positives vs false negatives.

Log mistaken cases and add new examples to a retraining queue to reduce blind spots.

Optimization Features

Inference Optimization
Uses ModernBERT features like FlashAttention and unpadding for faster inference

Reproducibility

Risks & Boundaries

Limitations

May miss highly novel attack vectors not represented in training data.

Private dataset components prevent exact reproduction of training data and results.

When Not To Use

When you need provable or formally verified defenses against adversarial inputs.

When your threat model expects entirely new attack families not in the training mix.

Failure Modes

False positives: unusual formatting, assertive or security-like phrasing in benign prompts.

False negatives: subtle adversarial phrasing that differs from known patterns.

Core Entities

Models

answerdotai/ModernBERT-largequalifire/prompt-injection-sentinelprotectai/deberta-v3-base-prompt-injection-v2

Metrics

AccuracyF1-scoreRecallPrecision

Datasets

OpenSafetyLab/Salad-Dataalespalla/chatbot-instruction-promptsmicrosoft/orca-agentinstruct-1M-v1verazuo/jailbreak-llmslmsys/toxic-chatVMware/open-instructreshabhs/SPML-Chatbot-Prompt-Injectionqualifire-synthetics (private)qualifire/Qualifire-prompt-injection-benchmark

Benchmarks

Internal held-out test set (10%)allenai/wildjailbreakjackhhao/jailbreak-classificationdeepset/prompt-injectionsqualifire/Qualifire-prompt-injection-benchmark