Stress Testing Papers — Parsed & Scored for Practitioners

Reprogram frozen LLMs to forecast time series using text prototypes and Prompt-as-Prefix

0.70

0.60

0.70

127

You can add time series forecasting to an existing LLM deployment with little extra training and often better accuracy in low-data and cross-domain cases.

Key finding

TIME-LLM improves average long-term MSE over a fine-tuned LLM baseline (GPT4TS).

Numbers: ≈12% average MSE reduction vs GPT4TS on evaluated long-term benchmarks

Small, irrelevant changes to Theory-of-Mind vignettes make GPT-3.5 fail

1.00

79

Relying on LLMs' apparent commonsense reasoning can be risky: models may fail on small, realistic changes and produce misleading outputs in user-facing scenarios.

Key finding

Making an opaque container transparent causes GPT-3.5 to predict the agent believes the wrong content.

Numbers: Variation 1A: P(chocolate)=95% vs P(popcorn)=0%

Jamba: hybrid Transformer + Mamba + MoE that fits long contexts in one 80GB GPU

0.30

0.80

0.70

40

Jamba lets teams process much longer documents on standard GPUs while cutting memory needs and speeding up inference; that reduces infrastructure cost for long-document products and enables new features like huge-context summarization.

Key finding

Hybrid Jamba reduces KV cache for 256K tokens to 4GB.

Numbers: KV cache (256K, 16bit): Jamba 4GB vs Mixtral 32GB vs Llama‑2 128GB

Small prompt formatting changes can swing LLM accuracy by tens of points

0.60

40

Small, innocuous prompt formatting choices can produce large and unpredictable swings in LLM performance, which can mislead model selection, harm user experience, or produce fragile products unless you test multiple formats.

Key finding

Formatting can change accuracy by very large amounts.

Numbers: Max spread 76 accuracy points (LLaMA-2-13B)

LLMs show some social reasoning but fail adversarial and robust tests

0.25

0.45

0.20

36

Don't assume LLMs understand people just because they give human-like answers; test models with adversarial and diverse benchmarks before using them for social judgments.

Key finding

Some models excel on narrow ToM-style tasks but not across the board

Numbers: TriangleCOPA: flan-t5-xxl 96% vs MFC 52%

LLMs favor certain option IDs, making multiple-choice evaluation brittle

0.60

0.50

0.70

22

MCQ-format evaluation and automated grading can be unstable: models may pick "A" or "C" by habit, producing misleading scores. Fixing this improves model reliability with minimal compute.

Key finding

Simple answer-moving changes cause large accuracy swings.

Numbers: gpt-3.5-turbo MMLU: 67.2 → 60.9 (−6.3) when golden moved to D; llama-30B: 53.1 → 68.2 (+15.2) when moved to A

ChatGPT can track multi-turn dialogue states zero-shot, but struggles with slot-filling and long conversations

0.40

0.35

0.30

21

ChatGPT can be used zero-shot to prototype multi-turn dialogue state tracking with near research-level JGA, but is unreliable for precise slot extraction without careful prompt design and output checks.

Key finding

ChatGPT achieves competitive multi-turn DST but lags fine-tuned SOTA.

Numbers: MultiWOZ2.1 JGA 60.28% vs fine-tuned SOTA 61.02% (Table 3)

A tiny common-sense math prompt exposes dramatic, inconsistent reasoning in many SOTA LLMs

0.20

0.40

0.20

20

High benchmark scores can hide brittle model behavior. Simple checks with structure-preserving variations catch failures that matter for reliability, safety, and customer trust.

Key finding

Most SOTA models fail or perform inconsistently on a simple common-sense problem.

Numbers: Majority of models p_correct < 0.2; GPT‑4o p=0.649, Claude 3 Opus p=0.431, many models p≈0

Long-context LLMs fail to learn reliably from very long in‑context demonstrations

0.40

0.50

0.40

17

If you depend on LLM few‑shot prompts for fine‑grained classification in long documents, current long‑context LLMs are unreliable; plan to fine‑tune or add retrieval/structured classifiers instead.

Key finding

On the hardest task (Discovery, 174 labels), almost all evaluated LLMs score ~0% accuracy; Gemini‑1.5‑Pro achieves 14% while a fine‑tuned BERT reaches 87%.

Numbers: Discovery: most models 0%; Gemini 14%; BERT fine-tuned 87%

Med-HALT: a public benchmark that tests LLM hallucinations on medical multiple-choice and PubMed retrieval tasks

0.20

0.55

0.30

17

If you plan to use LLMs for medical content or literature retrieval, expect frequent confident errors unless you add external retrieval, verification, or human oversight; Med‑HALT lets you measure that risk quantitatively.

Key finding

No model achieved clinical-grade accuracy on reasoning hallucination tests.

Numbers: Llama‑2 70B Reasoning FCT accuracy 42.21% (Table 2)

TravelPlanner: a realistic travel-planning benchmark — GPT-4 reaches only 0.6% full success on test tasks

1.00

0.70

0.60

13

Current LLM agents are not yet reliable enough to fully automate complex multi-constraint planning; but they can draft plans quickly and cut human effort if paired with verification and robust data collection.

Key finding

State-of-the-art LLMs largely fail to produce fully feasible travel plans.

Numbers: GPT-4 final pass rate = 0.6% on test set (two-stage)

AgentSims: a visual, multi-agent sandbox to build task-based LLM benchmarks quickly

0.50

0.60

13

AgentSims helps teams test language models in realistic, multi-step roles (e.g., mayor, employee). That reveals operational gaps not visible with static benchmarks and speeds prototyping for productized agents.

Key finding

Task-based evaluation reduces hackability, broadens tested abilities, and yields an objective pass rate.

Systematic comparison and new benchmarks for editing facts in LLMs

0.50

0.60

12

Model editing lets teams fix or update a deployed LLM quickly without expensive full retraining, but different editors trade off reliability, generalization, side effects, and ops cost.

Key finding

Memory-based and locate-edit methods can reach near-perfect scores on standard benchmarks but still fail to transfer edits reliably to related facts.

Numbers: SERAC: reliability 99.89% on COUNTERFACT (T5-XL)

Membership inference mostly fails on pretrained LLMs; apparent successes often come from dataset shifts

0.40

0.60

0.40

10

Most standard membership inference tests will not show large privacy leakage for models pre-trained at scale; but careless benchmark choices (e.g., temporally shifted non-members) can falsely signal leakage.

Key finding

Existing MIAs mostly fail against pre-trained LLMs.

Numbers: Most AUC ROC < 0.6 across domains (Table 1).

Use short dialogues (not static tests) to map where LLMs fail at everyday spatial reasoning

1.00

0.60

0.40

9

If your product relies on spatial commonsense (navigation, robotics, instructions), off-the-shelf LLMs can appear confident but make non-obvious errors; you must validate behavior with multi-turn tests before deployment.

Key finding

LLMs often answer fluently but make basic spatial mistakes

Use LLMs to auto-generate hardware test inputs and recover coverage that random testing misses

0.55

0.65

0.45

9

LLM-driven stimulus generation can cut manual effort in hardware verification and replace inefficient random testing for many components, but it needs prompt tuning and careful model selection.

Key finding

For several modules, LLMs reached full (100%) coverage on the evaluated coverage plans.

Numbers: 100% coverage on Asynchronous FIFO & AMPLE Weight Bank (Table III)

Teach an LLM to 'forget' bad behaviors using only negative examples and cheap finetuning

0.60

0.80

9

If your priority is to stop a model from producing specific harmful or copyrighted outputs quickly and cheaply, unlearning cuts those outputs dramatically with only finetune-level compute and no costly human-written positive examples.

Key finding

Unlearning can reduce harmful output rates to near zero on evaluated harmful prompts.

Numbers: harmful rate 47% -> 1% (OPT-1.3B, Table 3)

Clinical LLM trained on hospital notes shows large generalization gaps across hospitals, ages, and comorbidity levels

0.40

0.30

0.60

8

A clinical LLM that does not generalize across hospitals or patient groups risks wrong predictions, worse care, and financial penalties for readmissions; small local fine-tuning often yields the best improvement for underperforming sites.

Key finding

Temporal baseline performance (global fine-tune) AUC = 73.60%.

Numbers: AUC = 73.60% (temporal test)

A public benchmark that measures prompt injection, interpreter abuse, exploit generation, and a safety-utility tradeoff for LLMs

0.70

0.60

0.40

8

LLMs can betray system instructions and help abuse attached interpreters; measuring these behaviors helps product and security teams decide model choice, add guardrails, and quantify user experience tradeoffs.

Key finding

Prompt injections still succeed on modern models.

Numbers: Average injection success ≈ 28%; per-model range reported 13%–47%

Dr.Spider: 17 targeted perturbations reveal brittle text-to-SQL systems

0.40

0.60

0.40

8

Text-to-SQL systems that appear accurate in lab tests can silently fail in real use when users phrase questions differently or when schemas store data in alternate formats. That leads to wrong query results and bad UX. Dr.Spider helps find these blind spots before deployment.

Key finding

State-of-the-art text-to-SQL models suffer meaningful accuracy drops on Dr.Spider.

Numbers: Overall execution accuracy drop for best model (PICARD): 76.6% -> 65.9% (14.0% relative/10.7pt abs)

A tiny synthetic benchmark shows Transformers make sporadic, hard-to-fix memory errors

0.35

0.60

8

Sporadic, rare reasoning failures in Transformers can surface as hard-to-detect errors in production; fixing them needs better data coverage or architecture changes, not only hyperparameter tuning.

Key finding

Transformers exhibit a long, irregular tail of sporadic read errors (attention glitches) on FFLM.

Numbers: Observed across 10,625 Transformer runs; many nonzero o.o.d. glitch rates

LongBench — 21 long-text tasks (Chinese+English) to measure LLMs' long-context understanding up to tens of thousands of tokens

0.70

0.55

0.50

8

If you process long documents (reports, legal files, code repos), LongBench measures real-world long-context ability and shows whether models truly use long inputs or just memorize shortcuts.

Key finding

LongBench covers 21 datasets, 6 task categories, and 4,750 test instances with long contexts.

Numbers: 21 datasets; 6 categories; 4,750 instances; avg len 6,711 words (EN), 13,386 chars (ZH).

Many top multimodal LLMs ignore explicit 'no' constraints and still draw the excluded object

0.40

0.30

7

If your product depends on images that must exclude certain content (safety, branding, legal), current multimodal LLMs can silently fail and even claim they succeeded; add verification or blocklisting before shipping.

Key finding

For the prompt 'Generate an image of an elephant with no tusks', no model produced a correct image in any tested run or language.

Numbers: 0/5 correct across tested runs and languages (Section 3.6; Table 1)

LLM evaluations miss important variability: greedy often beats sampling, but best-of-N can unlock smaller models

0.60

0.70

7

Single-output LLM benchmarks can hide real-world variability. Testing multiple samples, greedy vs sampling, and best-of-N selection reveals reliability and can let smaller cheaper models match higher-cost models.

Key finding

Greedy decoding usually outperforms average sampling across most evaluated benchmarks.

Numbers: Multiple models: typical sampling std 0.3–2.5 and ∆ up to 27.5 points (Table 2)