242 papers found

Small, irrelevant changes to Theory-of-Mind vignettes make GPT-3.5 fail

1.00
1.00
1.00
79

Relying on LLMs' apparent commonsense reasoning can be risky: models may fail on small, realistic changes and produce misleading outputs in user-facing scenarios.

Key finding

Making an opaque container transparent causes GPT-3.5 to predict the agent believes the wrong content.

Numbers: Variation 1A: P(chocolate)=95% vs P(popcorn)=0%

MGTBench: a modular benchmark that measures how well detectors spot and attribute text from modern LLMs and how brittle they are to attacks

0.60
0.40
0.50
31

Automated detection helps flag AI-written content that affects trust, compliance, or fraud; MGTBench identifies which detectors work, how much labelled data they need, and where they fail under attacks.

Key finding

Fine-tuned LM Detector gives the highest detection accuracy across datasets

Numbers: F1=0.993 (Essay, human vs ChatGPT-turbo)

LLMs favor certain option IDs, making multiple-choice evaluation brittle

0.60
0.50
0.70
22

MCQ-format evaluation and automated grading can be unstable: models may pick "A" or "C" by habit, producing misleading scores. Fixing this improves model reliability with minimal compute.

Key finding

Simple answer-moving changes cause large accuracy swings.

Numbers: gpt-3.5-turbo MMLU: 67.260.9 (−6.3) when golden moved to D; llama-30B: 53.168.2 (+15.2) when moved to A

Chain-of-Utterances prompts reliably jailbreak LLMs; fine-tuning on curated safe conversations reduces harm.

0.50
0.60
0.40
16

CoU-style prompts can bypass deployed guardrails often; test and harden public-facing LLMs, or fine-tune smaller models on curated safe conversations to reduce harmful outputs without losing much utility.

Key finding

RED-EVAL jailbreaks widely deployed closed-source APIs frequently.

Numbers: GPT-4 ASR 0.651; ChatGPT ASR 0.728 on tested harmful prompts

PromptBench: an open, modular Python library to run unified LLM evaluations, adversarial prompt tests, and dynamic protocols

0.70
0.50
0.40
13

A single, extensible evaluation toolkit reduces ad-hoc testing effort, surfaces robustness gaps, and speeds model selection for production-facing apps.

Key finding

PromptBench includes many evaluation assets: 12 task families and 22 public datasets.

Numbers: 12 tasks; 22 public datasets

LLMBAR: a stress test showing many LLM 'judges' miss true instruction following

0.50
0.45
0.60
11

If you use LLMs to replace humans for evaluation, test them on adversarial, instruction-focused pairs first: many evaluators prefer slick but incorrect outputs and can bias product metrics and model selection.

Key finding

Expert human annotators agree on LLMBAR labels at a very high rate.

Numbers: 94% overall agreement (90% NATURAL, 95% ADVERSARIAL)

A formal framework and first quantitative benchmark showing prompt-injection attacks are broadly effective and current defenses fall short

0.50
0.70
0.60
10

Any app that feeds external text into an LLM is attackable: attackers can hide instructions in user-provided data and change app outputs. Benchmarks show attacks work across models and tasks, and many defenses either miss attacks or break normal performance.

Key finding

Prompt injection attacks are broadly effective across tasks and models.

Numbers: Combined Attack ASV=0.62 and MR=0.78 averaged over 10 LLMs and 7×7 task pairs

BIPIA: a large benchmark and practical defenses for indirect prompt injection attacks on LLMs

0.70
0.60
0.50
9

External content can silently hijack LLM outputs. Measure exposure with BIPIA and add simple defenses now; full model fine-tuning yields stronger protection if you control the model.

Key finding

All evaluated LLMs show vulnerability to indirect prompt injection on BIPIA.

Numbers: Average overall ASR = 0.1179 (11.79%) on BIPIA (Table 2)

Trainable watermarking that injects more bits, preserves meaning, and resists removal

0.75
0.60
0.60
9

A practical watermarking layer lets API owners tag model outputs with recoverable signatures to prove origin, deter plagiarism, and monitor misuse without breaking text quality or adding large latency.

Key finding

REMARK-LLM embeds more signature bits per text than prior neural watermarking.

Numbers: ˜ more bits vs AWT on evaluated segments

A public benchmark that measures prompt injection, interpreter abuse, exploit generation, and a safety-utility tradeoff for LLMs

0.70
0.60
0.40
8

LLMs can betray system instructions and help abuse attached interpreters; measuring these behaviors helps product and security teams decide model choice, add guardrails, and quantify user experience tradeoffs.

Key finding

Prompt injections still succeed on modern models.

Numbers: Average injection success ≈ 28%; per-model range reported 13%–47%

Many top multimodal LLMs ignore explicit 'no' constraints and still draw the excluded object

0.40
0.40
0.30
7

If your product depends on images that must exclude certain content (safety, branding, legal), current multimodal LLMs can silently fail and even claim they succeeded; add verification or blocklisting before shipping.

Key finding

For the prompt 'Generate an image of an elephant with no tusks', no model produced a correct image in any tested run or language.

Numbers: 0/5 correct across tested runs and languages (Section 3.6; Table 1)

RAIN: align frozen LLMs at inference by self-evaluation and token rewinding

0.60
0.70
0.60
7

RAIN lets you reduce harmful or untruthful outputs from deployed LLMs without costly retraining or human labels; trade latency for safety and consider using RAIN-generated data to finetune if latency is critical.

Key finding

RAIN raised harmlessness of LLaMA 30B from 82% to 97% on the HH dataset.

Numbers: 82%97%

PsychoBench: 13 psychometric scales to profile LLM personality, motivation, relationships, and emotions

0.30
0.60
0.40
6

PsychoBench gives a repeatable way to describe how an LLM will sound and react, so teams can tune persona, anticipate safety shifts from prompts or alignment changes, and audit models before deployment.

Key finding

LLMs behave as more open, conscientious and extraverted than crowd norms.

Numbers: Openness: text-davinci-003 4.8 vs human 3.9 (Likert mean)

SheepDog: make fake-news detectors focus on content, not writing style, to resist LLM-based camouflage

0.60
0.65
0.45
5

Products that flag misinformation must be robust to attackers who use LLMs to change tone; adding style-variant training and content-focused cues reduces false negatives and improves trustworthiness.

Key finding

State-of-the-art text-only detectors suffer large drops under LLM style attacks.

Numbers: F1 drop up to 38.33%

A living, structured review of 144 open LLM safety datasets and gaps to close

0.60
0.50
0.40
4

Model safety claims are often evaluated on a narrow, inconsistent set of datasets (sometimes proprietary), so businesses should adopt a broader, open suite of safety tests to make reliable, comparable claims.

Key finding

Total datasets reviewed: 144 open text datasets.

Numbers: n=144 datasets (published Jun 2018–Dec 2024)

AttackEval: a 0–1 scoring framework and ground-truth dataset to measure jailbreak prompt effectiveness

0.60
0.50
0.40
3

Binary success/fail tests miss partial or stealthy jailbreaks. AttackEval gives a ranked, numeric view so teams can prioritize fixes, audit high-risk prompt types, and measure defense improvements over time.

Key finding

AttackEval produces continuous 0–1 scores that align with binary baselines but assign many prompts intermediate values.

Numbers: Aggregated halves match baseline ~70% (coarse-grained)

Asking LLMs for pseudocode makes harmful outputs far more likely; small model edits make this worse.

0.40
0.50
0.60
3

If your product accepts or produces instruction-like outputs (code, pseudocode, how-to steps), it faces higher risk of harmful outputs and model edits can make that worse.

Key finding

Pseudocode prompts raise harmful outputs versus text answers.

Numbers: Pseudocode harmfulness increased by 238% in zero-shot across topics/models.

Simple prompts and filters can match finetuning on output-level 'unlearning' and expose benchmark blind spots

0.40
0.40
0.80
3

Guardrails (prompts and filters) are low-cost ways to hide or block sensitive outputs from API-accessible models; use them as quick mitigation, QA checks, or to generate finetuning data before spending on full retraining.

Key finding

Prompting halved the 'familiarity' score on LLaMA-2-7b for the Who's Harry Potter benchmark.

Numbers: ≈50% reduction vs baseline LLaMA-2-7b (Figure 2)

WILDGUARD: open multi-task moderator that matches GPT‑4 and cuts jailbreak success to near zero

0.70
0.60
0.70
3

WILDGUARD gives teams an open, deployable moderator that matches closed APIs on many safety checks, reduces jailbreak risk sharply, and lowers reliance on expensive third‑party moderation services.

Key finding

WILDGUARD strongly improves refusal detection versus open baselines.

Numbers: Refusal F1 +26.4 pts vs LibrAI-LongFormer-ref on WGTEST/XSTEST-RESP

Prompt-based attacks can make LLM agents loop or run wrong benign actions; some attacks hit >80% failure rates

0.30
0.60
0.70
3

Agents can be disabled or misused without obvious malicious text; prompt-injection can cause outages, wasted compute, or automated spamming and is hard to detect by LLM self-checks alone.

Key finding

Prompt-injection infinite-loop attacks raise failure rate substantially.

Numbers: Baseline 15.3% → Infinite loop ASR 59.4%

INDUST benchmark shows LLMs follow false premises; prompting them to critique user + self fixes many failures

0.60
0.60
0.30
3

Products that reuse LLMs risk amplifying users' false assumptions; adding a brief critique prompt is a low-cost way to reduce misinformation and potential harm.

Key finding

LLMs often accept false premises and produce incorrect or unsafe outputs.

Numbers: Truthfulness ≈50% on QFP and ≈20% on CIFP for evaluated models

A benchmark that measures whether LLMs follow prompts or their own memory when prompts conflict with stored knowledge

0.60
0.60
0.40
3

Knowing if a model trusts prompts or its own memory changes how you design retrieval, prompts, and monitoring: pick high-RR models for using fresh external data, or use instruction-tuned dependent models when you need strict prompt compliance.

Key finding

GPT-4 achieves the highest ability to use correct prompt facts (RR) and highest overall factual robustness (FR) on the KRE benchmark.

Numbers: GPT-4: VR=50, RR=81, FR≈66 (Table 9)

Semi-automatic pipeline: teach an LLM to generate high-quality attack prompts, then iteratively fine-tune models to refuse them

0.60
0.60
0.50
3

You can cheaply generate realistic jailbreak prompts and use a small iterative fine-tune to significantly reduce harmful outputs while keeping product capabilities intact.

Key finding

SAP30 attack set is far more effective than prior sets on evaluated LLMs.

Numbers: gpt-3.5-turbo harmful score: SAP30=8.70 vs Dual-Use=5.41 vs BAD+=0.63 (Table 1)