SecBench: 44.8k MCQs + 3.1k SAQs for multi-level, multi-language cybersecurity evaluation with automated LLM labeling and grading

Overview

Decision SnapshotNeeds Validation

The dataset is large and multi-dimensional, making it useful for testing cybersecurity capabilities, but it is Chinese-heavy and relies on LLM-based labeling/grading, which requires manual validation before high-stakes deployment.

Citations0

Evidence Strength0.70

Confidence0.90

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/3

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 60%

Authors

Pengfei Jing, Mengyun Tang, Xiaorong Shi, Xing Zheng, Sen Nie, Shi Wu, Yong Yang, Xiapu Luo

Links

Abstract / PDF / Data

Why It Matters For Business

SecBench provides a large, focused testbed to vet LLMs on cybersecurity tasks; use it to compare candidate models on recall and reasoning before deployment.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Data Scientist

Summary TLDR

SecBench is a large cybersecurity benchmark built from open sources and a public question-design contest. It contains 44,823 multiple-choice questions (MCQs) and 3,087 short-answer questions (SAQs). Questions are labeled by capability (Knowledge Retention vs Logical Reasoning), domain (9 cybersecurity subdomains), and language (Chinese and English). GPT-4 was used to label items and GPT-4o-mini to grade SAQs automatically. The authors benchmark 16 modern LLMs and report that Tencent Hunyuan-Turbo tops MCQ accuracy (94.28%), while o1-preview/o1-mini lead SAQ scores (~89%/87.5%). The dataset is Chinese-heavy and the authors provide an artifact link and evaluation prompts.

Problem Statement

Existing LLM benchmarks focus on general knowledge or are small in scale for cybersecurity. Prior cybersecurity datasets are limited in quantity and mainly use multiple-choice questions. There is a need for a larger, multi-form benchmark that includes short-answer questions to test reasoning and generation in cybersecurity.

Main Contribution

Released SecBench: 44,823 MCQs and 3,087 SAQs labeled by level, domain, and language.

Designed a multi-dimensional schema: two levels (Knowledge Retention, Logical Reasoning), two languages (Chinese, English), two forms (MCQ, SAQ), and nine security domains.

Key Findings

SecBench scale and composition

Numbers44,823 MCQs; 3,087 SAQs

Practical UseYou can run large-scale cybersecurity tests covering both selection and free-form answers.

Evidence RefAbstract, §4

Dataset language bias toward Chinese

NumbersMCQs: 80.4% Chinese; SAQs: 97.4% Chinese

Practical UseExpect Chinese-heavy coverage; translate questions before English-only evaluations.

Evidence Ref§4.3 (Fig.3, Fig.4)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
MCQ average correctness (top model)	94.28%	GPT-4o 90.99%	+3.29 pp	All 44,823 MCQs (SecBench)	Table 1 reports Hunyuan-Turbo average correctness 94.28%	Table 1
SAQ average score (top models)	o1-preview 89.24%; o1-mini 87.50%	GPT-4o-mini 82.49%	o1-preview +6.75 pp vs GPT-4o-mini	All 3,087 SAQs (SecBench)	Table 2 lists average SAQ scores graded by GPT-4o-mini	Table 2

What To Try In 7 Days

Download SecBench artifact and run a small subset (one domain) against your candidate models.

Use SAQs to probe reasoning and free-text generation failure modes.

Adopt an automated grading agent (e.g., GPT-4o-mini) and spot-check results manually for calibration.

Agent Features

Tool Use

GPT-4 used to label question level and domainGPT-4o-mini used as a grading agent for SAQsOpenCompass used for MCQ evaluation

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

https://zenodo.org/records/14575303 https://secbench.org/

Risks & Boundaries

Limitations

Strong language bias: majority of MCQs and almost all SAQs are Chinese.

Most MCQs test knowledge retention (90.8%); fewer MCQs challenge reasoning.

When Not To Use

When you need a fully human-validated gold standard for evaluation.

When your deployment is English-only without translating Chinese items.

Failure Modes

Grading agent may mis-score nuanced or partially correct free-text answers.

LLM-based labeling may misassign domain or difficulty, especially for ambiguous items.

Core Entities

Models

GPT-4GPT-4oGPT-4o-miniGPT-3.5-Turboo1-previewo1-miniHunyuan-TurboQwen2-72B-InstructQwen2-7B-InstructDeepSeek-V3DeepSeek-V2-LiteLlama-3-70B-InstructLlama-3-8B-InstructMixtral-8x7BYi-1.5-34B-ChatYi-1.5-9B-Chat

Metrics

correctness percentageaverage SAQ score (percentage)

Datasets

SecBench

Benchmarks

MMLUC-EvalHumanEval

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

SecBench scale and composition

Dataset language bias toward Chinese

Results

What To Try In 7 Days

Agent Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding