Sentinel — a ModernBERT detector that flags prompt injections with ~98% F1 on internal tests

Overview

Decision SnapshotReady For Pilot

Results are strong on the reported internal test set and public benchmarks, but reliance on private training data and unknown performance on novel attack styles lower certainty for unseen real-world inputs.

Citations0

Evidence Strength0.85

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 80%

Novelty: 50%

Authors

Dror Ivry, Oran Nahum

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Sentinel materially reduces successful prompt injections on evaluated benchmarks, enabling safer prompt-driven products and lowering the risk of harmful or leaking responses.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Data Scientist

Summary TLDR

Sentinel is a binary classifier built by fine-tuning answerdotai/ModernBERT-large to detect prompt-injection and jailbreak prompts. Trained on a large mix of open and private datasets (≈70% benign / 30% jailbreak) and evaluated on a held-out internal test set and four public benchmarks, Sentinel scores 0.987 accuracy and 0.980 F1 on the internal test set and an average F1 of 0.938 on public benchmarks. It outperforms a strong DeBERTa-v3 baseline by wide margins on evaluated sets. Limitations: private training data reduces reproducibility and highly novel attacks may evade detection.

Problem Statement

LLMs can be tricked by malicious inputs that hide instructions (prompt injection). Existing detectors often overfit to known attacks and fail on new or diverse jailbreaks. The paper aims to build a robust detector that generalizes across varied injection styles.

Main Contribution

A production-ready detector (qualifire/prompt-injection-sentinel) fine-tuned from ModernBERT-large for binary jailbreak detection.

A curated multi-source training corpus combining several open datasets and a private synthetic set, with a 70/30 benign-to-jailbreak composition.

Key Findings

High internal detection accuracy and F1.

NumbersAvgAcc 0.987, F1 0.980 on internal held-out test

Practical UseUse Sentinel to improve prompt-injection detection on similarly distributed inputs and datasets.

Evidence RefTable 1 (Internal Held-Out Test Set)

Strong cross-benchmark generalization on public datasets.

NumbersAvg public F1 0.938 vs baseline 0.709 (Δ≈0.229)

Practical UseExpect better detection across multiple standard jailbreak benchmarks compared to the evaluated DeBERTa baseline.

Evidence RefTable 2 (Public Benchmarks)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
AvgAcc (internal)	0.987 (Sentinel)	0.848 (DeBERTa baseline)	+0.139	Internal held-out test set (Table 1)	Sentinel AvgAcc 0.987 vs 0.848 for baseline	Table 1
F1 (internal)	0.980 (Sentinel)	0.728 (DeBERTa baseline)	+0.252	Internal held-out test set (Table 1)	Sentinel F1 0.980 vs 0.728 for baseline	Table 1

What To Try In 7 Days

Run qualifire/prompt-injection-sentinel on your live prompt stream to measure current injection rates.

Compare Sentinel's outputs to your existing filter (or protectai baseline) and track false positives vs false negatives.

Log mistaken cases and add new examples to a retraining queue to reduce blind spots.

Optimization Features

Inference Optimization

Uses ModernBERT features like FlashAttention and unpadding for faster inference

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Code URLs

https://huggingface.co/qualifire/prompt-injection-sentinel https://huggingface.co/datasets/qualifire/Qualifire-prompt-injection-benchmark

Data URLs

https://huggingface.co/datasets/alespalla/chatbot-instruction-prompts https://huggingface.co/datasets/microsoft/orca-agentinstruct-1M-v1 https://huggingface.co/datasets/OpenSafetyLab/Salad-Data https://huggingface.co/datasets/deepset/prompt-injections https://huggingface.co/datasets/reshabhs/SPML-Chatbot-Prompt-Injection https://huggingface.co/datasets/jackhhao/jailbreak-classification https://huggingface.co/datasets/lmsys/toxic-chat https://huggingface.co/datasets/VMware/open-instruct

Risks & Boundaries

Limitations

May miss highly novel attack vectors not represented in training data.

Private dataset components prevent exact reproduction of training data and results.

When Not To Use

When you need provable or formally verified defenses against adversarial inputs.

When your threat model expects entirely new attack families not in the training mix.

Failure Modes

False positives: unusual formatting, assertive or security-like phrasing in benign prompts.

False negatives: subtle adversarial phrasing that differs from known patterns.

Core Entities

Models

answerdotai/ModernBERT-largequalifire/prompt-injection-sentinelprotectai/deberta-v3-base-prompt-injection-v2

Metrics

AccuracyF1-scoreRecallPrecision

Datasets

OpenSafetyLab/Salad-Dataalespalla/chatbot-instruction-promptsmicrosoft/orca-agentinstruct-1M-v1verazuo/jailbreak-llmslmsys/toxic-chatVMware/open-instructreshabhs/SPML-Chatbot-Prompt-Injectionqualifire-synthetics (private)qualifire/Qualifire-prompt-injection-benchmark

Benchmarks

Internal held-out test set (10%)allenai/wildjailbreakjackhhao/jailbreak-classificationdeepset/prompt-injectionsqualifire/Qualifire-prompt-injection-benchmark

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

High internal detection accuracy and F1.

Strong cross-benchmark generalization on public datasets.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Short adversarial suffixes can flip LLM-as-a-Judge decisions; CUA >30% success

Key finding

BackdoorAgent: a stage-aware framework and benchmark showing memory backdoors persist across multi-step LLM agents

Key finding

JudgeDeceiver: automatically craft prompts that reliably trick LLM-as-a-Judge to pick an attacker’s response

Key finding

Make tool-using LLM agents provably safe by combining safety engineering, info-flow labels, and MCP extensions

Key finding

A systematic, practitioner-focused map of 193 multi-agent security threats and how 16 frameworks cover them

Key finding