Overview
The dataset is large and multi-dimensional, making it useful for testing cybersecurity capabilities, but it is Chinese-heavy and relies on LLM-based labeling/grading, which requires manual validation before high-stakes deployment.
Citations0
Evidence Strength0.70
Confidence0.90
Risk Signals9
Trust Signals
Findings with numeric evidence: 4/5
Findings with evidence refs: 5/5
Results with explicit delta: 3/3
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 50%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
SecBench provides a large, focused testbed to vet LLMs on cybersecurity tasks; use it to compare candidate models on recall and reasoning before deployment.
Who Should Care
Summary TLDR
SecBench is a large cybersecurity benchmark built from open sources and a public question-design contest. It contains 44,823 multiple-choice questions (MCQs) and 3,087 short-answer questions (SAQs). Questions are labeled by capability (Knowledge Retention vs Logical Reasoning), domain (9 cybersecurity subdomains), and language (Chinese and English). GPT-4 was used to label items and GPT-4o-mini to grade SAQs automatically. The authors benchmark 16 modern LLMs and report that Tencent Hunyuan-Turbo tops MCQ accuracy (94.28%), while o1-preview/o1-mini lead SAQ scores (~89%/87.5%). The dataset is Chinese-heavy and the authors provide an artifact link and evaluation prompts.
Problem Statement
Existing LLM benchmarks focus on general knowledge or are small in scale for cybersecurity. Prior cybersecurity datasets are limited in quantity and mainly use multiple-choice questions. There is a need for a larger, multi-form benchmark that includes short-answer questions to test reasoning and generation in cybersecurity.
Main Contribution
Released SecBench: 44,823 MCQs and 3,087 SAQs labeled by level, domain, and language.
Designed a multi-dimensional schema: two levels (Knowledge Retention, Logical Reasoning), two languages (Chinese, English), two forms (MCQ, SAQ), and nine security domains.
Key Findings
SecBench scale and composition
Dataset language bias toward Chinese
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| MCQ average correctness (top model) | 94.28% | GPT-4o 90.99% | +3.29 pp | All 44,823 MCQs (SecBench) | Table 1 reports Hunyuan-Turbo average correctness 94.28% | Table 1 |
| SAQ average score (top models) | o1-preview 89.24%; o1-mini 87.50% | GPT-4o-mini 82.49% | o1-preview +6.75 pp vs GPT-4o-mini | All 3,087 SAQs (SecBench) | Table 2 lists average SAQ scores graded by GPT-4o-mini | Table 2 |
What To Try In 7 Days
Download SecBench artifact and run a small subset (one domain) against your candidate models.
Use SAQs to probe reasoning and free-text generation failure modes.
Adopt an automated grading agent (e.g., GPT-4o-mini) and spot-check results manually for calibration.
Agent Features
Tool Use
Reproducibility
Risks & Boundaries
Limitations
Strong language bias: majority of MCQs and almost all SAQs are Chinese.
Most MCQs test knowledge retention (90.8%); fewer MCQs challenge reasoning.
When Not To Use
When you need a fully human-validated gold standard for evaluation.
When your deployment is English-only without translating Chinese items.
Failure Modes
Grading agent may mis-score nuanced or partially correct free-text answers.
LLM-based labeling may misassign domain or difficulty, especially for ambiguous items.

