Overview
Production Readiness
0.6
Novelty Score
0.7
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
Choosing a single provider for generation and evaluation can create persistent, compounding biases; diversify model providers or audit provider-level tendencies before deploying multi-model systems.
Summary TLDR
The paper proposes a psychometric forced-choice framework to audit stable, provider-level behavioral tendencies in LLMs ("lab signatures"). Using masked cloze items, mixed-effects models, ICC, and robustness checks (decoy masking and pole reversal), the authors find significant provider clustering across most social and epistemic dimensions (e.g., sycophancy, moderation, economic valence). Practical takeaway: model-provider choice can systematically shape multi-agent pipelines and should be audited or diversified to avoid recursive bias.
Problem Statement
Standard benchmarks measure per-task accuracy but miss durable, provider-level response policies that persist across versions. In multi-model stacks (generation, judging, summarization), such lab-level tendencies can compound, creating systemic bias. The paper asks: can we reliably detect and measure these durable lab signatures under conditions that reduce models 'gaming' the test?
Main Contribution
A measurement framework using forced-choice cloze items with semantically orthogonal decoys to hide evaluative intent.
A variance-decomposition approach (MixedLM + ICC) that separates item/prompt effects from provider-level signal.
Empirical audit across major providers showing statistically significant provider clustering in most social and epistemic dimensions.
Practical robustness checks: pole reversal, decoy removal sensitivity, and permutation-invariant determinism.
Key Findings
Provider-level clustering is detectable and widespread.
Gemini family shows markedly higher sycophancy than others.
Decoy masking reduces measurable differences but limits gaming.
Lab signal is real but small in magnitude for some traits.
Measurement instrument is robust to scale inversion.
Results
Authority-weighted sycophancy (pairwise)
Epistemic Sycophancy omnibus p
Emotional Sycophancy ICC (provider)
Economic Inequality Valence omnibus p
Decoy masking sensitivity
Who Should Care
What To Try In 7 Days
Run a small forced-choice cloze audit on your critical prompts with and without decoy masking to detect provider sensitivity.
If you use multi-stage LLM pipelines, swap the judge model to a different provider and compare outputs for provider-level divergence.
Log and track provider-level metrics (e.g., deference, moderation) over time to detect durable shifts that could affect product behavior.
Reproducibility
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Decoy masking reduces statistical power and can hide meaningful differences.
- Audit focuses on a subset of providers and on forced-choice vignettes, so findings may not generalize to every task or model.
- Cannot attribute internal intentions to models; lab signatures indicate consistent output patterns, not motives.
When Not To Use
- When you need high-resolution, ground-truth accuracy for narrow tasks (use standard benchmarks instead).
- When you require opaque, open-ended generation evaluations where forced-choice mapping is impractical.
Failure Modes
- Models detect and adapt to the audit despite decoys, changing measured behavior.
- Lab signatures shift after major retraining or policy changes, invalidating past audits.
- Forced-choice scale may oversimplify nuanced normative judgments.
Core Entities
Models
- OpenAI GPT (GPT-4, GPT-5)
- Google Gemini (Gemini 2.0 Flash, Gemma)
- Anthropic Claude
- xAI Grok
Metrics
- Intraclass Correlation Coefficient (ICC)
- Mixed Linear Models (MixedLM)
- Kruskal-Wallis H-test
- Friedman test
- Pole reversal consistency

