Measure LLM behavior without labels by testing how outputs change under simple text edits

June 23, 20239 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

12

Authors

Neel Jain, Khalid Saifullah, Yuxin Wen, John Kirchenbauer, Manli Shu, Aniruddha Saha, Micah Goldblum, Jonas Geiping, Tom Goldstein

Links

Abstract / PDF

Why It Matters For Business

You can monitor key model behaviors on your own live or private data without building labeled test sets, enabling faster, cheaper, and continuously updated audits of knowledge, toxicity, and robustness.

Summary TLDR

The paper introduces a practical, label-free evaluation framework for large language models (LLMs). Instead of human labels, it measures how model outputs change when inputs are transformed (e.g., add a negation, append profanity, swap words, break tokenization, or replace prior sentences). These 'sensitivity' scores correlate with standard benchmarks (TriviaQA, LAMBADA, Perspective API) and let teams monitor models on their own data (e.g., live logs) without curated datasets. Code is available.

Problem Statement

Current LLM evaluations rely on small, human-labeled benchmarks that are costly to create and can be leaked into model training. This makes it hard to test model behavior on realistic, changing data or in production. The paper asks: can we evaluate key behaviors (knowledge, toxicity, context use, word order, tokenization robustness) self-supervisedly, using only transformations of raw text?

Main Contribution

A simple, general procedure to build self-supervised "sensitivity" metrics by comparing model outputs on original vs. transformed text.

Five case-study metrics: negation-based knowledge probing, profanity-triggered toxicity (F-Bomb), long-range context sensitivity (LRS), word-order sensitivity, and tokenization robustness.

Empirical validation showing these sensitivity scores correlate with standard supervised benchmarks (TriviaQA, LAMBADA, Perspective API) across many public and API models.

Practical guidance and open-source code (GitHub) so teams can run these checks on their own corpora or live traffic.

Key Findings

A negation-based "Sensitivity Score" closely tracks TriviaQA accuracy across many models.

Numbers1000-example sensitivity, std error < 0.002; plotted sqrt-like fit vs TriviaQA

A simple F-bomb transformation predicts model stoicism to profanity and correlates with Perspective API toxicity scores.

NumbersCorrelation observed between fraction of toxic generations and Perspective API (no single r reported)

Long-range context sensitivity (LRS) correlates with LAMBADA performance; larger models are generally more context-sensitive.

Numbers1000-example LRS, std error ≈ 0.002; positive correlation with LAMBADA (Figure 8)

Word-order sensitivity (random 1-swap) correlates with LRS and increases with model size and instruction finetuning.

NumbersMedian JSD on 5k examples; mean std error would be ~0.002

Tokenization robustness improves with more training exposure; OPT models (trained on fewer tokens) are the most sensitive.

NumbersMean JSD over 1k examples, std error ≈ 0.005; negative trend vs training FLOPs/tokens (Figures 10,22)

Instruction finetuning generally increases sensitivity metrics (negation, context, word order) but can reduce tokenization robustness inconsistently.

NumbersObserved gains for most instruction-tuned models across several scores; tokenization shows no reliable trend (Figures 5,

Results

Sensitivity Score (negations) vs TriviaQA

ValueStrong positive correlation; tracks sqrt-like relationship

BaselineTriviaQA accuracy

F-Bomb Toxicity Metric vs Perspective API

ValueClose correlation between fraction of toxic generations and Perspective API scores

BaselinePerspective API >= 0.5

LRS Score vs LAMBADA

ValuePositive correlation; larger models more sensitive

BaselineLAMBADA accuracy

Tokenization Sensitivity vs training tokens/FLOPs

ValueNegative trend: more training tokens => lower sensitivity

BaselineOPT family (fewest tokens) is worst

Who Should Care

What To Try In 7 Days

Run the negation sensitivity test on a sample of your support transcripts to flag models that ignore factual flips.

Append a profanity trigger to real user prompts and measure whether the model mirrors profanity; prioritize fixes if it does.

Swap prior sentences in product FAQs and measure context sensitivity to check if the model uses long-range context correctly.

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Negation metric assumes corpus has many true factual sentences; results are less meaningful on fiction or neutral text.
  • Toxicity test focuses on explicit profanity and uses a predefined bad-words list; it may miss nuanced toxicity.
  • Tokenization test only simulates character-split errors; other tokenization quirks are untested.
  • Sensitivity scores can be affected by model output entropy and by memorized training data overlaps.
  • Small models may be too insensitive or too noisy for these metrics to be informative.

When Not To Use

  • When you need human-graded ground-truth labels for nuanced judgments.
  • On corpora dominated by neutral or fictional sentences where negation flips are not meaningful.
  • For very small models where entropy or brittleness makes sensitivity scores unreliable.

Failure Modes

  • Model entropy (over- or under-confident outputs) can mask true sensitivity.
  • Memorization of training data can make a model appear invariant even when behavior is incorrect.
  • Instruction finetuning can change calibration; naive normalization may not fully correct this.
  • Benchmarks with narrow or simple sentence forms (e.g., TriviaQA) can hide structural failures.

Core Entities

Models

  • Pythia
  • GPT-2
  • GPT-J
  • GPT-Neo
  • OPT
  • LLaMA
  • MPT
  • Dolly
  • Vicuna
  • WizardLM
  • OpenAI API (ada, davinci, text-ada-001)
  • Cohere (command)

Metrics

  • Sensitivity Score (negations)
  • F-Bomb Toxicity Metric
  • LRS Score (long-range context)
  • Word Order Score (swap JSD)
  • Tokenization Sensitivity (broken-token JSD)

Datasets

  • Wikipedia
  • TriviaQA
  • LAMBADA
  • LDNOOBW profanity list

Benchmarks

  • TriviaQA
  • LAMBADA
  • Perspective API (toxicity)