Measure LLM behavior without labels by testing how outputs change under simple text edits

Overview

Decision SnapshotNeeds Validation

The approach is low-cost and practical for monitoring; evidence shows consistent correlations with standard benchmarks, but performance can be confounded by memorization and model entropy.

Citations12

Evidence Strength0.80

Confidence0.90

Risk Signals12

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 0/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 60%

Authors

Neel Jain, Khalid Saifullah, Yuxin Wen, John Kirchenbauer, Manli Shu, Aniruddha Saha, Micah Goldblum, Jonas Geiping, Tom Goldstein

Links

Abstract / PDF / Code / Data

Why It Matters For Business

You can monitor key model behaviors on your own live or private data without building labeled test sets, enabling faster, cheaper, and continuously updated audits of knowledge, toxicity, and robustness.

Who Should Care

Product Manager ML Engineer CTO Engineering Lead Data Scientist

Summary TLDR

The paper introduces a practical, label-free evaluation framework for large language models (LLMs). Instead of human labels, it measures how model outputs change when inputs are transformed (e.g., add a negation, append profanity, swap words, break tokenization, or replace prior sentences). These 'sensitivity' scores correlate with standard benchmarks (TriviaQA, LAMBADA, Perspective API) and let teams monitor models on their own data (e.g., live logs) without curated datasets. Code is available.

Problem Statement

Current LLM evaluations rely on small, human-labeled benchmarks that are costly to create and can be leaked into model training. This makes it hard to test model behavior on realistic, changing data or in production. The paper asks: can we evaluate key behaviors (knowledge, toxicity, context use, word order, tokenization robustness) self-supervisedly, using only transformations of raw text?

Main Contribution

A simple, general procedure to build self-supervised "sensitivity" metrics by comparing model outputs on original vs. transformed text.

Five case-study metrics: negation-based knowledge probing, profanity-triggered toxicity (F-Bomb), long-range context sensitivity (LRS), word-order sensitivity, and tokenization robustness.

Key Findings

A negation-based "Sensitivity Score" closely tracks TriviaQA accuracy across many models.

Numbers1000-example sensitivity, std error < 0.002; plotted sqrt-like fit vs TriviaQA

Practical UseYou can approximate knowledge-benchmark performance without labeled QA pairs by measuring how perplexity changes when facts are negated.

Evidence RefSection 4, Figure 3, Appendix A.1

A simple F-bomb transformation predicts model stoicism to profanity and correlates with Perspective API toxicity scores.

NumbersCorrelation observed between fraction of toxic generations and Perspective API (no single r reported)

Practical UseTo catch models that mimic user profanity, append a short profanity trigger and monitor generated text or next-token probabilities instead of using a changing external API.

Evidence RefSection 5, Figures 6-7, Appendix A.2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Sensitivity Score (negations) vs TriviaQA	Strong positive correlation; tracks sqrt-like relationship	TriviaQA accuracy	—	Wikipedia (1000 examples)	Figure 3 shows sensitivity tracks TriviaQA; sensitivity std error < 0.002	Section 4, Figure 3, Appendix A.1
F-Bomb Toxicity Metric vs Perspective API	Close correlation between fraction of toxic generations and Perspective API scores	Perspective API >= 0.5	—	Wikipedia prompts	Figures 6-7 show alignment between self-supervised toxic-generation counts and API scores	Section 5, Figures 6-7, Appendix A.2

What To Try In 7 Days

Run the negation sensitivity test on a sample of your support transcripts to flag models that ignore factual flips.

Append a profanity trigger to real user prompts and measure whether the model mirrors profanity; prioritize fixes if it does.

Swap prior sentences in product FAQs and measure context sensitivity to check if the model uses long-range context correctly.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/neelsjain/BYOD

Data URLs

https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/blob/master/en

Risks & Boundaries

Limitations

Negation metric assumes corpus has many true factual sentences; results are less meaningful on fiction or neutral text.

Toxicity test focuses on explicit profanity and uses a predefined bad-words list; it may miss nuanced toxicity.

When Not To Use

When you need human-graded ground-truth labels for nuanced judgments.

On corpora dominated by neutral or fictional sentences where negation flips are not meaningful.

Failure Modes

Model entropy (over- or under-confident outputs) can mask true sensitivity.

Memorization of training data can make a model appear invariant even when behavior is incorrect.

Core Entities

Models

PythiaGPT-2GPT-JGPT-NeoOPTLLaMAMPTDollyVicunaWizardLMOpenAI API (ada, davinci, text-ada-001)Cohere (command)

Metrics

Sensitivity Score (negations)F-Bomb Toxicity MetricLRS Score (long-range context)Word Order Score (swap JSD)Tokenization Sensitivity (broken-token JSD)

Datasets

WikipediaTriviaQALAMBADALDNOOBW profanity list

Benchmarks

TriviaQALAMBADAPerspective API (toxicity)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

A negation-based "Sensitivity Score" closely tracks TriviaQA accuracy across many models.

A simple F-bomb transformation predicts model stoicism to profanity and correlates with Perspective API toxicity scores.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Make LLM judges reliable and cheaper: EIF and tuned PPI give tighter, valid evaluation with few labels

Key finding

Model rankings from LLM judges become less biased and get valid confidence intervals when you down-weight unreliable judges

Key finding

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

MCTS-Judge: Use Monte Carlo Tree Search at test time to double LLM judge accuracy on code tasks

Key finding