Measure LLM behavior without labels by testing how outputs change under simple text edits

June 23, 20239 min

Overview

Decision SnapshotNeeds Validation

The approach is low-cost and practical for monitoring; evidence shows consistent correlations with standard benchmarks, but performance can be confounded by memorization and model entropy.

Citations12

Evidence Strength0.80

Confidence0.90

Risk Signals12

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 0/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 60%

Authors

Neel Jain, Khalid Saifullah, Yuxin Wen, John Kirchenbauer, Manli Shu, Aniruddha Saha, Micah Goldblum, Jonas Geiping, Tom Goldstein

Links

Abstract / PDF / Code / Data

Why It Matters For Business

You can monitor key model behaviors on your own live or private data without building labeled test sets, enabling faster, cheaper, and continuously updated audits of knowledge, toxicity, and robustness.

Who Should Care

Summary TLDR

The paper introduces a practical, label-free evaluation framework for large language models (LLMs). Instead of human labels, it measures how model outputs change when inputs are transformed (e.g., add a negation, append profanity, swap words, break tokenization, or replace prior sentences). These 'sensitivity' scores correlate with standard benchmarks (TriviaQA, LAMBADA, Perspective API) and let teams monitor models on their own data (e.g., live logs) without curated datasets. Code is available.

Problem Statement

Current LLM evaluations rely on small, human-labeled benchmarks that are costly to create and can be leaked into model training. This makes it hard to test model behavior on realistic, changing data or in production. The paper asks: can we evaluate key behaviors (knowledge, toxicity, context use, word order, tokenization robustness) self-supervisedly, using only transformations of raw text?

Main Contribution

A simple, general procedure to build self-supervised "sensitivity" metrics by comparing model outputs on original vs. transformed text.

Five case-study metrics: negation-based knowledge probing, profanity-triggered toxicity (F-Bomb), long-range context sensitivity (LRS), word-order sensitivity, and tokenization robustness.

Key Findings

A negation-based "Sensitivity Score" closely tracks TriviaQA accuracy across many models.

Numbers1000-example sensitivity, std error < 0.002; plotted sqrt-like fit vs TriviaQA

Practical UseYou can approximate knowledge-benchmark performance without labeled QA pairs by measuring how perplexity changes when facts are negated.

Evidence RefSection 4, Figure 3, Appendix A.1

A simple F-bomb transformation predicts model stoicism to profanity and correlates with Perspective API toxicity scores.

NumbersCorrelation observed between fraction of toxic generations and Perspective API (no single r reported)

Practical UseTo catch models that mimic user profanity, append a short profanity trigger and monitor generated text or next-token probabilities instead of using a changing external API.

Evidence RefSection 5, Figures 6-7, Appendix A.2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Sensitivity Score (negations) vs TriviaQAStrong positive correlation; tracks sqrt-like relationshipTriviaQA accuracyWikipedia (1000 examples)Figure 3 shows sensitivity tracks TriviaQA; sensitivity std error < 0.002Section 4, Figure 3, Appendix A.1
F-Bomb Toxicity Metric vs Perspective APIClose correlation between fraction of toxic generations and Perspective API scoresPerspective API >= 0.5Wikipedia promptsFigures 6-7 show alignment between self-supervised toxic-generation counts and API scoresSection 5, Figures 6-7, Appendix A.2

What To Try In 7 Days

Run the negation sensitivity test on a sample of your support transcripts to flag models that ignore factual flips.

Append a profanity trigger to real user prompts and measure whether the model mirrors profanity; prioritize fixes if it does.

Swap prior sentences in product FAQs and measure context sensitivity to check if the model uses long-range context correctly.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Negation metric assumes corpus has many true factual sentences; results are less meaningful on fiction or neutral text.

Toxicity test focuses on explicit profanity and uses a predefined bad-words list; it may miss nuanced toxicity.

When Not To Use

When you need human-graded ground-truth labels for nuanced judgments.

On corpora dominated by neutral or fictional sentences where negation flips are not meaningful.

Failure Modes

Model entropy (over- or under-confident outputs) can mask true sensitivity.

Memorization of training data can make a model appear invariant even when behavior is incorrect.

Core Entities

Models

PythiaGPT-2GPT-JGPT-NeoOPTLLaMAMPTDollyVicunaWizardLMOpenAI API (ada, davinci, text-ada-001)Cohere (command)

Metrics

Sensitivity Score (negations)F-Bomb Toxicity MetricLRS Score (long-range context)Word Order Score (swap JSD)Tokenization Sensitivity (broken-token JSD)

Datasets

WikipediaTriviaQALAMBADALDNOOBW profanity list

Benchmarks

TriviaQALAMBADAPerspective API (toxicity)