SuperCLUE: open + closed Chinese tests plus a user arena to predict what real users prefer

July 27, 20237 min

Overview

Decision SnapshotNeeds Validation

The benchmark is practical and immediately usable; results rely on a substantial CArena vote set and show strong GPT-4/human agreement, but some model access gaps and judge caveats remain.

Citations22

Evidence Strength0.70

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 0/7

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 50%

Authors

Liang Xu, Anqi Li, Lei Zhu, Hang Xue, Changtai Zhu, Kangkang Zhao, Haonan He, Xuanwei Zhang, Qiyue Kang, Zhenzhong Lan

Links

Abstract / PDF / Data

Why It Matters For Business

Closed multiple-choice scores do not guarantee user satisfaction; combine closed tests with multi-turn open evaluations and use an LLM judge like GPT-4 to estimate real-user preference faster and cheaper.

Who Should Care

Summary TLDR

SuperCLUE is a Chinese LLM benchmark built to predict real user preference. It bundles a user battle dataset (CArena, 9.9k votes), an open-ended test (OPEN, 600 single/multi-turn items) and a matched closed-ended test (CLOSE). Key findings: GPT-4 dominates; GPT-4 judgments align ~80% with humans; closed-format accuracy alone poorly predicts user preference; combining CLOSE and OPEN-multi-turn gives the best correlation with real user ratings.

Problem Statement

Standard benchmarks mostly use closed multiple-choice questions and accuracy, but real users interact in open, multi-turn ways. This paper asks: do closed tests predict what users prefer, and can a mixed benchmark plus automatic judging better forecast real user ratings?

Main Contribution

Built SuperCLUE: three complementary parts — CArena (real user votes), OPEN (open-ended single & multi-turn questions), and CLOSE (closed questions made from OPEN).

Collected and annotated CArena: 9.9k user votes and a 10-category capability taxonomy for Chinese user queries.

Key Findings

CArena contains 9.9k real user votes used as the gold standard for user preference.

Numbers9.9k votes (Section 3, CArena)

Practical UseIf you want to measure real user preference, collect direct user votes like CArena rather than relying only on synthetic benchmarks.

Evidence RefCArena description, Section 3

GPT-4 strongly outperforms other tested models on both open and closed tasks.

NumbersCLOSE 70.67% accuracy; OPEN_ALL win&tie 94.64% (Table 2)

Practical UseUse GPT-4 as a high-quality reference or judge when benchmarking Chinese LLMs; expect a large performance gap vs Chinese-oriented models.

Evidence RefTable 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy70.67%CLOSETable 2 reports GPT-4 CLOSE accuracy 70.67Table 2
GPT-4 win&tie rate on OPEN_ALL94.64%OPEN_ALLTable 2 reports GPT-4 OPEN_ALL 94.64Table 2

What To Try In 7 Days

Run a small CArena-style A/B with target users to collect real preference votes (even hundreds of votes help).

Evaluate models on a short OPEN multi-turn set and a matched CLOSE set; compute weighted combination (≈0.5 CLOSE + 0.5 OPEN-MULTI) to predict user preference.

Pilot GPT-4 as an automatic rater on a human-reviewed sample to confirm judge alignment before scaling automated evaluation.

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Model coverage incomplete: several commercial Chinese models lacked weight access and were partially evaluated via API.

OPEN has 600 items; useful but limited for niche capabilities.

When Not To Use

Do not rely only on CLOSE accuracy when selecting models for conversational or multi-turn products.

Avoid using GPT-4 as sole judge for math-heavy or strictly factual grading without human checks.

Failure Modes

GPT-4 judge may carry biases (verbosity/position) despite mitigation; unpredictable on specialized math.

CLOSE questions can compress rich open answers into a single 'correct' option, missing quality differences.

Core Entities

Models

GPT-4Claude-instant-v1RWKV-world-7BChatGLM-130BChatGLM2-6BWenxin YiyanMOSSZiya-13B360 BrainSparkDeskMiniMax

Metrics

Accuracywin_and_tie_rateEloPearson_correlationSpearman_correlation

Datasets

SuperCLUECArenaOPENCLOSE

Benchmarks

CLUEMT-benchMMLUBig-BenchC-Eval