SecQA: a compact multiple-choice benchmark to test LLM knowledge of computer security

December 26, 20236 min

Overview

Production Readiness

0.4

Novelty Score

0.4

Cost Impact Score

0.2

Citation Count

6

Authors

Zefang Liu

Links

Abstract / PDF

Why It Matters For Business

SecQA gives a quick, domain-specific check of LLM security knowledge. Use it to benchmark models before deploying them on security tasks and to spot when open models need domain tuning or retrieval augmentation.

Summary TLDR

SecQA is a two-version, multiple-choice dataset built from a modern computer-security textbook and generated with GPT-4. It provides small dev/val splits and larger test splits (v1 test=110, v2 test=100) to measure LLM accuracy on security knowledge. Evaluations show GPT-3.5/GPT-4 score near 99% on v1 and ~98% on v2, while open-source models vary widely. Because questions were generated with GPT-4, benchmark leakage and limited challenge for top models are important caveats.

Problem Statement

There is no compact, security-focused multiple-choice benchmark to quickly measure how well LLMs understand computer security. Existing general benchmarks miss domain nuances. The paper aims to create a concise, textbook-based QA set to diagnose LLMs' security knowledge and compare models under 0-shot and 5-shot settings.

Main Contribution

Created SecQA, a focused multiple-choice dataset for computer security with two difficulty tiers (v1: foundational, v2: advanced).

Generated questions with GPT-4 via two custom GPT agents (Cyber Quizmaster and Cyber Quizmaster Pro) and hand-refined them.

Evaluated many popular LLMs (GPT-3.5, GPT-4, Llama-2, Vicuna, Mistral, Zephyr) on SecQA in 0-shot and 5-shot settings and published accuracy tables.

Key Findings

GPT-3.5-Turbo and GPT-4 achieve near-perfect accuracy on SecQA v1 and very high on v2.

NumbersSecQAv1: GPT-3.5 99.1% 0/5-shot; GPT-4 99.1%/100% 0/5-shot. SecQAv2: GPT-3.5 98.0% / GPT-4 98.0%

Open-source LLMs show large, inconsistent gaps versus closed models on security QA.

NumbersExamples (SecQAv1 0-shot): Llama-2-7B 72.7%, Llama-2-13B 49.1%, Mistral 90.9%

5-shot prompting can swing accuracy dramatically for some models.

NumbersLlama-2-13B v1: 49.1%→89.1% (0→5-shot); Vicuna-7B v1: 65.5%→30.9% (drop)

SecQA is small and textbook-derived, with test sizes: v1 test=110, v2 test=100.

NumbersDev/Val/Test counts: v1 dev5 val12 test110; v2 dev5 val10 test100

Dataset generation method risks benchmark leakage because GPT-4 produced the questions.

NumbersAuthors note GPT-4 generated questions and admit this may limit challenge for GPT-4

Results

Accuracy

ValueGPT-3.5-Turbo 99.1% (SecQA v1, 0/5-shot)

Accuracy

ValueGPT-4 99.1% (0-shot) / 100.0% (5-shot) on SecQA v1

Accuracy

ValueGPT-3.5-Turbo 98.0% (SecQA v2, 0/5-shot)

Accuracy

ValueLlama-2-13B-Chat 49.1%→89.1% (SecQA v1, 0→5-shot)

Accuracy

ValueMistral-7B-Instruct-v0.2 90.9% (SecQA v1 0/5-shot), 89.0% (SecQA v2 0-shot)

Who Should Care

What To Try In 7 Days

Run SecQA v1 and v2 against candidate models to compare baseline security knowledge.

If open models score poorly, run small-scale fine-tuning or add a retrieval layer and re-evaluate.

Treat GPT-4 results cautiously; add held-out, human-written questions to test leakage.

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Questions were generated by GPT-4 then refined; this can bias results for GPT-4-family models.
  • Dataset is small: dev sets are tiny (5 examples each) and test sets are modest (110 and 100).
  • Multiple-choice format favors recognition over open-ended reasoning or code-level skills.
  • All items are textbook-derived, so real-world incident handling and adversarial scenarios are underrepresented.

When Not To Use

  • When you need a large, robust benchmark for stress-testing model safety or adversarial resistance.
  • When you need open-ended or hands-on security evaluation (e.g., exploit generation, detection pipelines).
  • When evaluating possible leakage for models that may have been trained on GPT-4–produced content.

Failure Modes

  • High scores from GPT-4 may reflect question familiarity, not true understanding.
  • Few-shot prompts can both help and hurt model accuracy depending on model and examples.
  • Open-source models show inconsistent scaling; results may not generalize across security subdomains.

Core Entities

Models

  • GPT-3.5-Turbo
  • GPT-4
  • Llama-2-7B-Chat
  • Llama-2-13B-Chat
  • Vicuna-7B-v1.5
  • Vicuna-13B-v1.5
  • Mistral-7B-Instruct-v0.2
  • Zephyr-7B-Beta

Metrics

  • Accuracy

Datasets

  • SecQA v1
  • SecQA v2

Benchmarks

  • SecQA

Context Entities

Datasets

  • MMLU
  • HELM
  • BIG-bench