SuperCLUE: open + closed Chinese tests plus a user arena to predict what real users prefer

July 27, 20237 min

Overview

Production Readiness

0.7

Novelty Score

0.5

Cost Impact Score

0.6

Citation Count

22

Authors

Liang Xu, Anqi Li, Lei Zhu, Hang Xue, Changtai Zhu, Kangkang Zhao, Haonan He, Xuanwei Zhang, Qiyue Kang, Zhenzhong Lan

Links

Abstract / PDF

Why It Matters For Business

Closed multiple-choice scores do not guarantee user satisfaction; combine closed tests with multi-turn open evaluations and use an LLM judge like GPT-4 to estimate real-user preference faster and cheaper.

Summary TLDR

SuperCLUE is a Chinese LLM benchmark built to predict real user preference. It bundles a user battle dataset (CArena, 9.9k votes), an open-ended test (OPEN, 600 single/multi-turn items) and a matched closed-ended test (CLOSE). Key findings: GPT-4 dominates; GPT-4 judgments align ~80% with humans; closed-format accuracy alone poorly predicts user preference; combining CLOSE and OPEN-multi-turn gives the best correlation with real user ratings.

Problem Statement

Standard benchmarks mostly use closed multiple-choice questions and accuracy, but real users interact in open, multi-turn ways. This paper asks: do closed tests predict what users prefer, and can a mixed benchmark plus automatic judging better forecast real user ratings?

Main Contribution

Built SuperCLUE: three complementary parts — CArena (real user votes), OPEN (open-ended single & multi-turn questions), and CLOSE (closed questions made from OPEN).

Collected and annotated CArena: 9.9k user votes and a 10-category capability taxonomy for Chinese user queries.

Showed GPT-4 can be used as an automatic judge in Chinese: GPT-4 vs human agreement ≈ 80% on OPEN.

Demonstrated closed-format accuracy alone is a weak predictor of user preference and that combining CLOSE and OPEN-multi improves correlation with user ratings.

Key Findings

CArena contains 9.9k real user votes used as the gold standard for user preference.

Numbers9.9k votes (Section 3, CArena)

GPT-4 strongly outperforms other tested models on both open and closed tasks.

NumbersCLOSE 70.67% accuracy; OPEN_ALL win&tie 94.64% (Table 2)

GPT-4 judgments correlate highly with human raters on OPEN.

NumbersPearson agreement ≈ 0.80 (80%) between GPT-4 and humans (Section 5)

Closed-ended accuracy does not reliably predict open-ended or user-preferred performance.

NumbersSpearman ρ=0.5150 (p=0.1915), Pearson ρ=0.5547 (p=0.1536) between CLOSE and OPEN_SINGLE

Combining CLOSE and OPEN-MULTI gives the best match to user votes.

NumbersBest linear combo: CLOSE coef 0.49 + OPEN_MULTIPLE coef 0.51 → correlation 0.9397 with CArena (Table 4)

Results

Accuracy

Value70.67%

GPT-4 win&tie rate on OPEN_ALL

Value94.64%

MiniMax OPEN_ALL win&tie rate

Value57.94%

CArena user votes collected

Value9,900 votes

Agreement GPT-4 vs humans on OPEN (Pearson)

Value0.80

Correlation CLOSE vs OPEN_SINGLE

ValueSpearman 0.5150; Pearson 0.5547

Best linear combo predicting CArena

ValueCLOSE 0.49 + OPEN_MULTIPLE 0.51 → corr 0.9397

Who Should Care

What To Try In 7 Days

Run a small CArena-style A/B with target users to collect real preference votes (even hundreds of votes help).

Evaluate models on a short OPEN multi-turn set and a matched CLOSE set; compute weighted combination (≈0.5 CLOSE + 0.5 OPEN-MULTI) to predict user preference.

Pilot GPT-4 as an automatic rater on a human-reviewed sample to confirm judge alignment before scaling automated evaluation.

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Model coverage incomplete: several commercial Chinese models lacked weight access and were partially evaluated via API.
  • OPEN has 600 items; useful but limited for niche capabilities.
  • CLOSE questions are generated by GPT-3.5 then human-corrected and may introduce artifacts.
  • CArena users are self-selected and may not represent broad demographics.

When Not To Use

  • Do not rely only on CLOSE accuracy when selecting models for conversational or multi-turn products.
  • Avoid using GPT-4 as sole judge for math-heavy or strictly factual grading without human checks.
  • CArena-based conclusions may not generalize to user groups not represented on the LangYa platform.

Failure Modes

  • GPT-4 judge may carry biases (verbosity/position) despite mitigation; unpredictable on specialized math.
  • CLOSE questions can compress rich open answers into a single 'correct' option, missing quality differences.
  • Combining metrics with fixed weights may overfit to the CArena population if user base shifts.

Core Entities

Models

  • GPT-4
  • Claude-instant-v1
  • RWKV-world-7B
  • ChatGLM-130B
  • ChatGLM2-6B
  • Wenxin Yiyan
  • MOSS
  • Ziya-13B
  • 360 Brain
  • SparkDesk
  • MiniMax

Metrics

  • Accuracy
  • win_and_tie_rate
  • Elo
  • Pearson_correlation
  • Spearman_correlation

Datasets

  • SuperCLUE
  • CArena
  • OPEN
  • CLOSE

Benchmarks

  • CLUE
  • MT-bench
  • MMLU
  • Big-Bench
  • C-Eval