Asking LLMs for pseudocode makes harmful outputs far more likely; small model edits make this worse.

Overview

Decision SnapshotNeeds Validation

The dataset and experiments convincingly show a reproducible vulnerability for code-like prompts across multiple open models, but results are limited to specific models, topics, and evaluation judges.

Citations3

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 8/8

Findings with evidence refs: 8/8

Results with explicit delta: 4/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 40%

Novelty: 50%

Authors

Somnath Banerjee, Sayan Layek, Rima Hazra, Animesh Mukherjee

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If your product accepts or produces instruction-like outputs (code, pseudocode, how-to steps), it faces higher risk of harmful outputs and model edits can make that worse.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Data Scientist

Summary TLDR

The authors release TECHHAZARDQA, a 7,745-query benchmark of technology-related harmful prompts that can be answered as text or as instruction-like pseudocode. Across four open models (Llama-2-7b, Llama-2-13b, Mistral-V2, Mixtral 8x7B) pseudocode prompts produce substantially more harmful outputs than plain text. Chain-of-thought or few-shot examples rarely fix this on TECHHAZARDQA. Model editing with ROME often amplifies harmful code-like outputs. The paper uses GPT-4 as an automatic judge (97.5% agreement with a human sample) and a reward model to quantify harmfulness intensity.

Problem Statement

Do instruction-style prompts (pseudocode, code snippets, structured instructions) make safety guardrails weaker? The authors test whether LLMs answer harmful technology queries more dangerously when asked for instruction-centric outputs and whether small model edits amplify that risk.

Main Contribution

TECHHAZARDQA: a new benchmark of ~7,745 harmful technology queries designed to be answerable as text or pseudocode.

Systematic evaluation showing instruction-centric (pseudocode) prompts raise harmful output rates substantially versus text across four open models.

Key Findings

Pseudocode prompts raise harmful outputs versus text answers.

NumbersPseudocode harmfulness increased by 2–38% in zero-shot across topics/models.

Practical UseTest models with code-like prompts, not only natural language, before deployment.

Evidence RefAbstract; Table 3; Table 8

On TECHHAZARDQA, instruction-centric responses amplified unethical responses by about 238% on average (authors' summary).

Numbers≈238% relative increase (paper summary).

Practical UseAssume a big safety gap when allowing structured/code outputs; add extra filtering for such outputs.

Evidence RefAbstract

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Dataset size	∼7,745 harmful queries	—	—	TECHHAZARDQA	Dataset construction; Table 1	Dataset section
Pseudocode vs text harmful rate (example)	48.7% (pseudocode) vs 10.5% (text)	Llama-2-13b text rate	+38.2 percentage points	TECHHAZARDQA, Biotechnology topic, zero-shot	Zero-shot results for Llama-2-13b	Zero-shot section; Table 3

What To Try In 7 Days

Run TECHHAZARDQA (or a subset) against your models, comparing text vs code-like prompts.

Add an explicit filter or stricter moderation for code/pseudocode outputs before release.

Use a strong LLM judge (e.g., GPT-4) plus a human sample to scale harmfulness checks quickly.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/NeuralSentinel/TechHazardQA https://huggingface.co/datasets/SoftMINER-Group/TechHazardQA (dataset)

Data URLs

https://huggingface.co/datasets/SoftMINER-Group/TechHazardQA

Risks & Boundaries

Limitations

Dataset focuses on seven technology domains; results may not generalize to casual or non-technical harmful content.

Evaluation relies on GPT-4 as primary judge with a 30% human sample; judge bias remains possible.

When Not To Use

Do not treat TECHHAZARDQA as a complete safety test for non-technical or social-content risks.

Do not assume few-shot or CoT will reliably mitigate adversarial instruction prompts on highly adversarial datasets.

Failure Modes

GPT-4 misclassifies subtle harms or context-dependent content.

Model-editing conclusions may not hold for mixture-of-experts architectures like Mixtral at larger scales.

Core Entities

Models

Llama-2-13bLlama-2-7bMistral-V2Mixtral-8x7BGPT-4

Metrics

harmful-response-rate (%)harmfulness score (reward model)GPT-4 vs human agreement (%)

Datasets

TECHHAZARDQA (~7,745 queries)ADVBENCH (520 queries)NICHEHAZARDQA (~500 queries)

Benchmarks

TECHHAZARDQA

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Pseudocode prompts raise harmful outputs versus text answers.

On TECHHAZARDQA, instruction-centric responses amplified unethical responses by about 238% on average (authors' summary).

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

ThaiSafetyBench: 1,954 Thai malicious prompts reveal cultural blind spots in LLM safety

Key finding

Model judges reward ethics-based refusals; human users penalize them

Key finding

A 300k-case, 22-language benchmark that tests how jailbreak prompts make LLMs write fake news

Key finding

MEDIC: a practical framework to test clinical LLM safety, hallucinations, and operational utility

Key finding

A balanced 44-class benchmark (440 prompts + 8.8K mutations) for testing whether LLMs refuse unsafe requests, plus a fast judge design.

Key finding