Overview
Production Readiness
0.7
Novelty Score
0.5
Cost Impact Score
0.6
Citation Count
22
Why It Matters For Business
Closed multiple-choice scores do not guarantee user satisfaction; combine closed tests with multi-turn open evaluations and use an LLM judge like GPT-4 to estimate real-user preference faster and cheaper.
Summary TLDR
SuperCLUE is a Chinese LLM benchmark built to predict real user preference. It bundles a user battle dataset (CArena, 9.9k votes), an open-ended test (OPEN, 600 single/multi-turn items) and a matched closed-ended test (CLOSE). Key findings: GPT-4 dominates; GPT-4 judgments align ~80% with humans; closed-format accuracy alone poorly predicts user preference; combining CLOSE and OPEN-multi-turn gives the best correlation with real user ratings.
Problem Statement
Standard benchmarks mostly use closed multiple-choice questions and accuracy, but real users interact in open, multi-turn ways. This paper asks: do closed tests predict what users prefer, and can a mixed benchmark plus automatic judging better forecast real user ratings?
Main Contribution
Built SuperCLUE: three complementary parts — CArena (real user votes), OPEN (open-ended single & multi-turn questions), and CLOSE (closed questions made from OPEN).
Collected and annotated CArena: 9.9k user votes and a 10-category capability taxonomy for Chinese user queries.
Showed GPT-4 can be used as an automatic judge in Chinese: GPT-4 vs human agreement ≈ 80% on OPEN.
Demonstrated closed-format accuracy alone is a weak predictor of user preference and that combining CLOSE and OPEN-multi improves correlation with user ratings.
Key Findings
CArena contains 9.9k real user votes used as the gold standard for user preference.
GPT-4 strongly outperforms other tested models on both open and closed tasks.
GPT-4 judgments correlate highly with human raters on OPEN.
Closed-ended accuracy does not reliably predict open-ended or user-preferred performance.
Combining CLOSE and OPEN-MULTI gives the best match to user votes.
Results
Accuracy
GPT-4 win&tie rate on OPEN_ALL
MiniMax OPEN_ALL win&tie rate
CArena user votes collected
Agreement GPT-4 vs humans on OPEN (Pearson)
Correlation CLOSE vs OPEN_SINGLE
Best linear combo predicting CArena
Who Should Care
What To Try In 7 Days
Run a small CArena-style A/B with target users to collect real preference votes (even hundreds of votes help).
Evaluate models on a short OPEN multi-turn set and a matched CLOSE set; compute weighted combination (≈0.5 CLOSE + 0.5 OPEN-MULTI) to predict user preference.
Pilot GPT-4 as an automatic rater on a human-reviewed sample to confirm judge alignment before scaling automated evaluation.
Reproducibility
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Model coverage incomplete: several commercial Chinese models lacked weight access and were partially evaluated via API.
- OPEN has 600 items; useful but limited for niche capabilities.
- CLOSE questions are generated by GPT-3.5 then human-corrected and may introduce artifacts.
- CArena users are self-selected and may not represent broad demographics.
When Not To Use
- Do not rely only on CLOSE accuracy when selecting models for conversational or multi-turn products.
- Avoid using GPT-4 as sole judge for math-heavy or strictly factual grading without human checks.
- CArena-based conclusions may not generalize to user groups not represented on the LangYa platform.
Failure Modes
- GPT-4 judge may carry biases (verbosity/position) despite mitigation; unpredictable on specialized math.
- CLOSE questions can compress rich open answers into a single 'correct' option, missing quality differences.
- Combining metrics with fixed weights may overfit to the CArena population if user base shifts.
Core Entities
Models
- GPT-4
- Claude-instant-v1
- RWKV-world-7B
- ChatGLM-130B
- ChatGLM2-6B
- Wenxin Yiyan
- MOSS
- Ziya-13B
- 360 Brain
- SparkDesk
- MiniMax
Metrics
- Accuracy
- win_and_tie_rate
- Elo
- Pearson_correlation
- Spearman_correlation
Datasets
- SuperCLUE
- CArena
- OPEN
- CLOSE
Benchmarks
- CLUE
- MT-bench
- MMLU
- Big-Bench
- C-Eval

