Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.5
Citation Count
12
Why It Matters For Business
You can monitor key model behaviors on your own live or private data without building labeled test sets, enabling faster, cheaper, and continuously updated audits of knowledge, toxicity, and robustness.
Summary TLDR
The paper introduces a practical, label-free evaluation framework for large language models (LLMs). Instead of human labels, it measures how model outputs change when inputs are transformed (e.g., add a negation, append profanity, swap words, break tokenization, or replace prior sentences). These 'sensitivity' scores correlate with standard benchmarks (TriviaQA, LAMBADA, Perspective API) and let teams monitor models on their own data (e.g., live logs) without curated datasets. Code is available.
Problem Statement
Current LLM evaluations rely on small, human-labeled benchmarks that are costly to create and can be leaked into model training. This makes it hard to test model behavior on realistic, changing data or in production. The paper asks: can we evaluate key behaviors (knowledge, toxicity, context use, word order, tokenization robustness) self-supervisedly, using only transformations of raw text?
Main Contribution
A simple, general procedure to build self-supervised "sensitivity" metrics by comparing model outputs on original vs. transformed text.
Five case-study metrics: negation-based knowledge probing, profanity-triggered toxicity (F-Bomb), long-range context sensitivity (LRS), word-order sensitivity, and tokenization robustness.
Empirical validation showing these sensitivity scores correlate with standard supervised benchmarks (TriviaQA, LAMBADA, Perspective API) across many public and API models.
Practical guidance and open-source code (GitHub) so teams can run these checks on their own corpora or live traffic.
Key Findings
A negation-based "Sensitivity Score" closely tracks TriviaQA accuracy across many models.
A simple F-bomb transformation predicts model stoicism to profanity and correlates with Perspective API toxicity scores.
Long-range context sensitivity (LRS) correlates with LAMBADA performance; larger models are generally more context-sensitive.
Word-order sensitivity (random 1-swap) correlates with LRS and increases with model size and instruction finetuning.
Tokenization robustness improves with more training exposure; OPT models (trained on fewer tokens) are the most sensitive.
Instruction finetuning generally increases sensitivity metrics (negation, context, word order) but can reduce tokenization robustness inconsistently.
Results
Sensitivity Score (negations) vs TriviaQA
F-Bomb Toxicity Metric vs Perspective API
LRS Score vs LAMBADA
Tokenization Sensitivity vs training tokens/FLOPs
Who Should Care
What To Try In 7 Days
Run the negation sensitivity test on a sample of your support transcripts to flag models that ignore factual flips.
Append a profanity trigger to real user prompts and measure whether the model mirrors profanity; prioritize fixes if it does.
Swap prior sentences in product FAQs and measure context sensitivity to check if the model uses long-range context correctly.
Reproducibility
Code Urls
Data Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Negation metric assumes corpus has many true factual sentences; results are less meaningful on fiction or neutral text.
- Toxicity test focuses on explicit profanity and uses a predefined bad-words list; it may miss nuanced toxicity.
- Tokenization test only simulates character-split errors; other tokenization quirks are untested.
- Sensitivity scores can be affected by model output entropy and by memorized training data overlaps.
- Small models may be too insensitive or too noisy for these metrics to be informative.
When Not To Use
- When you need human-graded ground-truth labels for nuanced judgments.
- On corpora dominated by neutral or fictional sentences where negation flips are not meaningful.
- For very small models where entropy or brittleness makes sensitivity scores unreliable.
Failure Modes
- Model entropy (over- or under-confident outputs) can mask true sensitivity.
- Memorization of training data can make a model appear invariant even when behavior is incorrect.
- Instruction finetuning can change calibration; naive normalization may not fully correct this.
- Benchmarks with narrow or simple sentence forms (e.g., TriviaQA) can hide structural failures.
Core Entities
Models
- Pythia
- GPT-2
- GPT-J
- GPT-Neo
- OPT
- LLaMA
- MPT
- Dolly
- Vicuna
- WizardLM
- OpenAI API (ada, davinci, text-ada-001)
- Cohere (command)
Metrics
- Sensitivity Score (negations)
- F-Bomb Toxicity Metric
- LRS Score (long-range context)
- Word Order Score (swap JSD)
- Tokenization Sensitivity (broken-token JSD)
Datasets
- Wikipedia
- TriviaQA
- LAMBADA
- LDNOOBW profanity list
Benchmarks
- TriviaQA
- LAMBADA
- Perspective API (toxicity)

