Overview
The benchmark is practical and immediately usable; results rely on a substantial CArena vote set and show strong GPT-4/human agreement, but some model access gaps and judge caveats remain.
Citations22
Evidence Strength0.70
Confidence0.85
Risk Signals10
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 0/7
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 70%
Novelty: 50%
Why It Matters For Business
Closed multiple-choice scores do not guarantee user satisfaction; combine closed tests with multi-turn open evaluations and use an LLM judge like GPT-4 to estimate real-user preference faster and cheaper.
Who Should Care
Summary TLDR
SuperCLUE is a Chinese LLM benchmark built to predict real user preference. It bundles a user battle dataset (CArena, 9.9k votes), an open-ended test (OPEN, 600 single/multi-turn items) and a matched closed-ended test (CLOSE). Key findings: GPT-4 dominates; GPT-4 judgments align ~80% with humans; closed-format accuracy alone poorly predicts user preference; combining CLOSE and OPEN-multi-turn gives the best correlation with real user ratings.
Problem Statement
Standard benchmarks mostly use closed multiple-choice questions and accuracy, but real users interact in open, multi-turn ways. This paper asks: do closed tests predict what users prefer, and can a mixed benchmark plus automatic judging better forecast real user ratings?
Main Contribution
Built SuperCLUE: three complementary parts — CArena (real user votes), OPEN (open-ended single & multi-turn questions), and CLOSE (closed questions made from OPEN).
Collected and annotated CArena: 9.9k user votes and a 10-category capability taxonomy for Chinese user queries.
Key Findings
CArena contains 9.9k real user votes used as the gold standard for user preference.
GPT-4 strongly outperforms other tested models on both open and closed tasks.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | 70.67% | — | — | CLOSE | Table 2 reports GPT-4 CLOSE accuracy 70.67 | Table 2 |
| GPT-4 win&tie rate on OPEN_ALL | 94.64% | — | — | OPEN_ALL | Table 2 reports GPT-4 OPEN_ALL 94.64 | Table 2 |
What To Try In 7 Days
Run a small CArena-style A/B with target users to collect real preference votes (even hundreds of votes help).
Evaluate models on a short OPEN multi-turn set and a matched CLOSE set; compute weighted combination (≈0.5 CLOSE + 0.5 OPEN-MULTI) to predict user preference.
Pilot GPT-4 as an automatic rater on a human-reviewed sample to confirm judge alignment before scaling automated evaluation.
Reproducibility
Risks & Boundaries
Limitations
Model coverage incomplete: several commercial Chinese models lacked weight access and were partially evaluated via API.
OPEN has 600 items; useful but limited for niche capabilities.
When Not To Use
Do not rely only on CLOSE accuracy when selecting models for conversational or multi-turn products.
Avoid using GPT-4 as sole judge for math-heavy or strictly factual grading without human checks.
Failure Modes
GPT-4 judge may carry biases (verbosity/position) despite mitigation; unpredictable on specialized math.
CLOSE questions can compress rich open answers into a single 'correct' option, missing quality differences.

