SecQA: a compact multiple-choice benchmark to test LLM knowledge of computer security

Overview

Decision SnapshotNeeds Validation

The dataset is useful for quick domain checks but is small, generated by GPT-4, and therefore not decisive for production readiness without further, independent validation.

Citations6

Evidence Strength0.30

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 2/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 20%

Production readiness: 40%

Novelty: 40%

Authors

Zefang Liu

Links

Abstract / PDF / Data

Why It Matters For Business

SecQA gives a quick, domain-specific check of LLM security knowledge. Use it to benchmark models before deploying them on security tasks and to spot when open models need domain tuning or retrieval augmentation.

Who Should Care

CTO ML Engineer Product Manager Data Scientist Engineering Lead

Summary TLDR

SecQA is a two-version, multiple-choice dataset built from a modern computer-security textbook and generated with GPT-4. It provides small dev/val splits and larger test splits (v1 test=110, v2 test=100) to measure LLM accuracy on security knowledge. Evaluations show GPT-3.5/GPT-4 score near 99% on v1 and ~98% on v2, while open-source models vary widely. Because questions were generated with GPT-4, benchmark leakage and limited challenge for top models are important caveats.

Problem Statement

There is no compact, security-focused multiple-choice benchmark to quickly measure how well LLMs understand computer security. Existing general benchmarks miss domain nuances. The paper aims to create a concise, textbook-based QA set to diagnose LLMs' security knowledge and compare models under 0-shot and 5-shot settings.

Main Contribution

Created SecQA, a focused multiple-choice dataset for computer security with two difficulty tiers (v1: foundational, v2: advanced).

Generated questions with GPT-4 via two custom GPT agents (Cyber Quizmaster and Cyber Quizmaster Pro) and hand-refined them.

Key Findings

GPT-3.5-Turbo and GPT-4 achieve near-perfect accuracy on SecQA v1 and very high on v2.

NumbersSecQAv1: GPT-3.5 99.1% 0/5-shot; GPT-4 99.1%/100% 0/5-shot. SecQAv2: GPT-3.5 98.0% / GPT-4 98.0%

Practical UseDon't expect SecQA v1/v2 to expose weaknesses in top proprietary LLMs; use it to verify basic domain coverage but not as a stress test for GPT-4-class models.

Evidence RefTable 3

Open-source LLMs show large, inconsistent gaps versus closed models on security QA.

NumbersExamples (SecQAv1 0-shot): Llama-2-7B 72.7%, Llama-2-13B 49.1%, Mistral 90.9%

Practical UseExpect wide quality differences when using open models; plan domain tuning or retrieval for production security tasks.

Evidence RefTable 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	GPT-3.5-Turbo 99.1% (SecQA v1, 0/5-shot)	—	—	SecQA v1	Table 3 reports 99.1% for GPT-3.5-Turbo on SecQAv1	Table 3
Accuracy	GPT-4 99.1% (0-shot) / 100.0% (5-shot) on SecQA v1	—	0.9 pp increase 0→5-shot	SecQA v1	Table 3 shows 99.1%→100.0% for GPT-4	Table 3

What To Try In 7 Days

Run SecQA v1 and v2 against candidate models to compare baseline security knowledge.

If open models score poorly, run small-scale fine-tuning or add a retrieval layer and re-evaluate.

Treat GPT-4 results cautiously; add held-out, human-written questions to test leakage.

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

https://huggingface.co/datasets/zefang-liu/secqa

Risks & Boundaries

Limitations

Questions were generated by GPT-4 then refined; this can bias results for GPT-4-family models.

Dataset is small: dev sets are tiny (5 examples each) and test sets are modest (110 and 100).

When Not To Use

When you need a large, robust benchmark for stress-testing model safety or adversarial resistance.

When you need open-ended or hands-on security evaluation (e.g., exploit generation, detection pipelines).

Failure Modes

High scores from GPT-4 may reflect question familiarity, not true understanding.

Few-shot prompts can both help and hurt model accuracy depending on model and examples.

Core Entities

Models

GPT-3.5-TurboGPT-4Llama-2-7B-ChatLlama-2-13B-ChatVicuna-7B-v1.5Vicuna-13B-v1.5Mistral-7B-Instruct-v0.2Zephyr-7B-Beta

Metrics

Accuracy

Datasets

SecQA v1SecQA v2

Benchmarks

SecQA

Context Entities

Datasets

MMLUHELMBIG-bench

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

GPT-3.5-Turbo and GPT-4 achieve near-perfect accuracy on SecQA v1 and very high on v2.

Open-source LLMs show large, inconsistent gaps versus closed models on security QA.

Results

What To Try In 7 Days

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Datasets

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding