Overview
The framework is practically useful for auditing provider-level bias; evidence is robust across several checks but is limited to forced-choice items and the audited providers.
Citations0
Evidence Strength0.70
Confidence0.80
Risk Signals8
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 1/5
Reproducibility
Status: No open assets linked
Open source: Unknown
At A Glance
Cost impact: 60%
Production readiness: 60%
Novelty: 70%
Why It Matters For Business
Choosing a single provider for generation and evaluation can create persistent, compounding biases; diversify model providers or audit provider-level tendencies before deploying multi-model systems.
Who Should Care
Summary TLDR
The paper proposes a psychometric forced-choice framework to audit stable, provider-level behavioral tendencies in LLMs ("lab signatures"). Using masked cloze items, mixed-effects models, ICC, and robustness checks (decoy masking and pole reversal), the authors find significant provider clustering across most social and epistemic dimensions (e.g., sycophancy, moderation, economic valence). Practical takeaway: model-provider choice can systematically shape multi-agent pipelines and should be audited or diversified to avoid recursive bias.
Problem Statement
Standard benchmarks measure per-task accuracy but miss durable, provider-level response policies that persist across versions. In multi-model stacks (generation, judging, summarization), such lab-level tendencies can compound, creating systemic bias. The paper asks: can we reliably detect and measure these durable lab signatures under conditions that reduce models 'gaming' the test?
Main Contribution
A measurement framework using forced-choice cloze items with semantically orthogonal decoys to hide evaluative intent.
A variance-decomposition approach (MixedLM + ICC) that separates item/prompt effects from provider-level signal.
Key Findings
Provider-level clustering is detectable and widespread.
Gemini family shows markedly higher sycophancy than others.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Authority-weighted sycophancy (pairwise) | Gemini M=1.93; Claude M=1.59 | — | — | Audit vignettes (forced-choice) | Section 4.1 | Section 4.1 |
| Epistemic Sycophancy omnibus p | p ≈ 1×10^-6 | — | — | All providers, audit items | Section 5.2 Table 2 | Table 2 |
What To Try In 7 Days
Run a small forced-choice cloze audit on your critical prompts with and without decoy masking to detect provider sensitivity.
If you use multi-stage LLM pipelines, swap the judge model to a different provider and compare outputs for provider-level divergence.
Log and track provider-level metrics (e.g., deference, moderation) over time to detect durable shifts that could affect product behavior.
Reproducibility
Risks & Boundaries
Limitations
Decoy masking reduces statistical power and can hide meaningful differences.
Audit focuses on a subset of providers and on forced-choice vignettes, so findings may not generalize to every task or model.
When Not To Use
When you need high-resolution, ground-truth accuracy for narrow tasks (use standard benchmarks instead).
When you require opaque, open-ended generation evaluations where forced-choice mapping is impractical.
Failure Modes
Models detect and adapt to the audit despite decoys, changing measured behavior.
Lab signatures shift after major retraining or policy changes, invalidating past audits.

