SuperCLUE: open + closed Chinese tests plus a user arena to predict what real users prefer

Overview

Decision SnapshotNeeds Validation

The benchmark is practical and immediately usable; results rely on a substantial CArena vote set and show strong GPT-4/human agreement, but some model access gaps and judge caveats remain.

Citations22

Evidence Strength0.70

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 0/7

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 50%

Authors

Liang Xu, Anqi Li, Lei Zhu, Hang Xue, Changtai Zhu, Kangkang Zhao, Haonan He, Xuanwei Zhang, Qiyue Kang, Zhenzhong Lan

Links

Abstract / PDF / Data

Why It Matters For Business

Closed multiple-choice scores do not guarantee user satisfaction; combine closed tests with multi-turn open evaluations and use an LLM judge like GPT-4 to estimate real-user preference faster and cheaper.

Who Should Care

Product Manager ML Engineer CTO Data Scientist Founder

Summary TLDR

SuperCLUE is a Chinese LLM benchmark built to predict real user preference. It bundles a user battle dataset (CArena, 9.9k votes), an open-ended test (OPEN, 600 single/multi-turn items) and a matched closed-ended test (CLOSE). Key findings: GPT-4 dominates; GPT-4 judgments align ~80% with humans; closed-format accuracy alone poorly predicts user preference; combining CLOSE and OPEN-multi-turn gives the best correlation with real user ratings.

Problem Statement

Standard benchmarks mostly use closed multiple-choice questions and accuracy, but real users interact in open, multi-turn ways. This paper asks: do closed tests predict what users prefer, and can a mixed benchmark plus automatic judging better forecast real user ratings?

Main Contribution

Built SuperCLUE: three complementary parts — CArena (real user votes), OPEN (open-ended single & multi-turn questions), and CLOSE (closed questions made from OPEN).

Collected and annotated CArena: 9.9k user votes and a 10-category capability taxonomy for Chinese user queries.

Key Findings

CArena contains 9.9k real user votes used as the gold standard for user preference.

Numbers9.9k votes (Section 3, CArena)

Practical UseIf you want to measure real user preference, collect direct user votes like CArena rather than relying only on synthetic benchmarks.

Evidence RefCArena description, Section 3

GPT-4 strongly outperforms other tested models on both open and closed tasks.

NumbersCLOSE 70.67% accuracy; OPEN_ALL win&tie 94.64% (Table 2)

Practical UseUse GPT-4 as a high-quality reference or judge when benchmarking Chinese LLMs; expect a large performance gap vs Chinese-oriented models.

Evidence RefTable 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	70.67%	—	—	CLOSE	Table 2 reports GPT-4 CLOSE accuracy 70.67	Table 2
GPT-4 win&tie rate on OPEN_ALL	94.64%	—	—	OPEN_ALL	Table 2 reports GPT-4 OPEN_ALL 94.64	Table 2

What To Try In 7 Days

Run a small CArena-style A/B with target users to collect real preference votes (even hundreds of votes help).

Evaluate models on a short OPEN multi-turn set and a matched CLOSE set; compute weighted combination (≈0.5 CLOSE + 0.5 OPEN-MULTI) to predict user preference.

Pilot GPT-4 as an automatic rater on a human-reviewed sample to confirm judge alignment before scaling automated evaluation.

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

https://www.CLUEbenchmarks.com (paper states planned release)

Risks & Boundaries

Limitations

Model coverage incomplete: several commercial Chinese models lacked weight access and were partially evaluated via API.

OPEN has 600 items; useful but limited for niche capabilities.

When Not To Use

Do not rely only on CLOSE accuracy when selecting models for conversational or multi-turn products.

Avoid using GPT-4 as sole judge for math-heavy or strictly factual grading without human checks.

Failure Modes

GPT-4 judge may carry biases (verbosity/position) despite mitigation; unpredictable on specialized math.

CLOSE questions can compress rich open answers into a single 'correct' option, missing quality differences.

Core Entities

Models

GPT-4Claude-instant-v1RWKV-world-7BChatGLM-130BChatGLM2-6BWenxin YiyanMOSSZiya-13B360 BrainSparkDeskMiniMax

Metrics

Accuracywin_and_tie_rateEloPearson_correlationSpearman_correlation

Datasets

SuperCLUECArenaOPENCLOSE

Benchmarks

CLUEMT-benchMMLUBig-BenchC-Eval

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

CArena contains 9.9k real user votes used as the gold standard for user preference.

GPT-4 strongly outperforms other tested models on both open and closed tasks.

Results

What To Try In 7 Days

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding