SecBench: 44.8k MCQs + 3.1k SAQs for multi-level, multi-language cybersecurity evaluation with automated LLM labeling and grading

December 30, 20246 min

Overview

Decision SnapshotNeeds Validation

The dataset is large and multi-dimensional, making it useful for testing cybersecurity capabilities, but it is Chinese-heavy and relies on LLM-based labeling/grading, which requires manual validation before high-stakes deployment.

Citations0

Evidence Strength0.70

Confidence0.90

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/3

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 60%

Authors

Pengfei Jing, Mengyun Tang, Xiaorong Shi, Xing Zheng, Sen Nie, Shi Wu, Yong Yang, Xiapu Luo

Links

Abstract / PDF / Data

Why It Matters For Business

SecBench provides a large, focused testbed to vet LLMs on cybersecurity tasks; use it to compare candidate models on recall and reasoning before deployment.

Who Should Care

Summary TLDR

SecBench is a large cybersecurity benchmark built from open sources and a public question-design contest. It contains 44,823 multiple-choice questions (MCQs) and 3,087 short-answer questions (SAQs). Questions are labeled by capability (Knowledge Retention vs Logical Reasoning), domain (9 cybersecurity subdomains), and language (Chinese and English). GPT-4 was used to label items and GPT-4o-mini to grade SAQs automatically. The authors benchmark 16 modern LLMs and report that Tencent Hunyuan-Turbo tops MCQ accuracy (94.28%), while o1-preview/o1-mini lead SAQ scores (~89%/87.5%). The dataset is Chinese-heavy and the authors provide an artifact link and evaluation prompts.

Problem Statement

Existing LLM benchmarks focus on general knowledge or are small in scale for cybersecurity. Prior cybersecurity datasets are limited in quantity and mainly use multiple-choice questions. There is a need for a larger, multi-form benchmark that includes short-answer questions to test reasoning and generation in cybersecurity.

Main Contribution

Released SecBench: 44,823 MCQs and 3,087 SAQs labeled by level, domain, and language.

Designed a multi-dimensional schema: two levels (Knowledge Retention, Logical Reasoning), two languages (Chinese, English), two forms (MCQ, SAQ), and nine security domains.

Key Findings

SecBench scale and composition

Numbers44,823 MCQs; 3,087 SAQs

Practical UseYou can run large-scale cybersecurity tests covering both selection and free-form answers.

Evidence RefAbstract, §4

Dataset language bias toward Chinese

NumbersMCQs: 80.4% Chinese; SAQs: 97.4% Chinese

Practical UseExpect Chinese-heavy coverage; translate questions before English-only evaluations.

Evidence Ref§4.3 (Fig.3, Fig.4)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
MCQ average correctness (top model)94.28%GPT-4o 90.99%+3.29 ppAll 44,823 MCQs (SecBench)Table 1 reports Hunyuan-Turbo average correctness 94.28%Table 1
SAQ average score (top models)o1-preview 89.24%; o1-mini 87.50%GPT-4o-mini 82.49%o1-preview +6.75 pp vs GPT-4o-miniAll 3,087 SAQs (SecBench)Table 2 lists average SAQ scores graded by GPT-4o-miniTable 2

What To Try In 7 Days

Download SecBench artifact and run a small subset (one domain) against your candidate models.

Use SAQs to probe reasoning and free-text generation failure modes.

Adopt an automated grading agent (e.g., GPT-4o-mini) and spot-check results manually for calibration.

Agent Features

Tool Use
GPT-4 used to label question level and domainGPT-4o-mini used as a grading agent for SAQsOpenCompass used for MCQ evaluation

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Strong language bias: majority of MCQs and almost all SAQs are Chinese.

Most MCQs test knowledge retention (90.8%); fewer MCQs challenge reasoning.

When Not To Use

When you need a fully human-validated gold standard for evaluation.

When your deployment is English-only without translating Chinese items.

Failure Modes

Grading agent may mis-score nuanced or partially correct free-text answers.

LLM-based labeling may misassign domain or difficulty, especially for ambiguous items.

Core Entities

Models

GPT-4GPT-4oGPT-4o-miniGPT-3.5-Turboo1-previewo1-miniHunyuan-TurboQwen2-72B-InstructQwen2-7B-InstructDeepSeek-V3DeepSeek-V2-LiteLlama-3-70B-InstructLlama-3-8B-InstructMixtral-8x7BYi-1.5-34B-ChatYi-1.5-9B-Chat

Metrics

correctness percentageaverage SAQ score (percentage)

Datasets

SecBench

Benchmarks

MMLUC-EvalHumanEval