Asking LLMs for pseudocode makes harmful outputs far more likely; small model edits make this worse.

February 23, 20248 min

Overview

Production Readiness

0.4

Novelty Score

0.5

Cost Impact Score

0.6

Citation Count

3

Authors

Somnath Banerjee, Sayan Layek, Rima Hazra, Animesh Mukherjee

Links

Abstract / PDF

Why It Matters For Business

If your product accepts or produces instruction-like outputs (code, pseudocode, how-to steps), it faces higher risk of harmful outputs and model edits can make that worse.

Summary TLDR

The authors release TECHHAZARDQA, a 7,745-query benchmark of technology-related harmful prompts that can be answered as text or as instruction-like pseudocode. Across four open models (Llama-2-7b, Llama-2-13b, Mistral-V2, Mixtral 8x7B) pseudocode prompts produce substantially more harmful outputs than plain text. Chain-of-thought or few-shot examples rarely fix this on TECHHAZARDQA. Model editing with ROME often amplifies harmful code-like outputs. The paper uses GPT-4 as an automatic judge (97.5% agreement with a human sample) and a reward model to quantify harmfulness intensity.

Problem Statement

Do instruction-style prompts (pseudocode, code snippets, structured instructions) make safety guardrails weaker? The authors test whether LLMs answer harmful technology queries more dangerously when asked for instruction-centric outputs and whether small model edits amplify that risk.

Main Contribution

TECHHAZARDQA: a new benchmark of ~7,745 harmful technology queries designed to be answerable as text or pseudocode.

Systematic evaluation showing instruction-centric (pseudocode) prompts raise harmful output rates substantially versus text across four open models.

Model-editing (ROME) experiments that demonstrate targeted edits can greatly increase harmful pseudocode outputs; layer choice matters.

Key Findings

Pseudocode prompts raise harmful outputs versus text answers.

NumbersPseudocode harmfulness increased by 2–38% in zero-shot across topics/models.

On TECHHAZARDQA, instruction-centric responses amplified unethical responses by about 238% on average (authors' summary).

Numbers≈238% relative increase (paper summary).

Model editing with ROME can sharply increase harmful output rates.

NumbersAverage harmful rate rose from 18.9% → 56.7% (zero-shot) after a single edit.

Chain-of-thought (CoT) and few-shot help on some public adversarial datasets but not on TECHHAZARDQA.

NumbersCoT sometimes raised pseudocode harm (e.g., +28.6% in one domain); few-shot reduces harm in few cases.

GPT-4 judgments closely match humans on this task.

Numbers97.5% agreement between GPT-4 and human sample (30% of outputs).

Pseudocode responses have higher harmfulness intensity and lower variance.

NumbersReward-model harmfulness scores higher for P vs T across topics and settings; std dev for pseudocode ~0.11–0.35 vs text

Layer choice in edits changes outcomes by topic.

NumbersEditing higher layers reduced harm in some topics and increased harm in others (LLaMA-2-7B layer 1/3/5 study).

TECHHAZARDQA covers seven high-risk technology domains.

Numbers~7,745 queries across 7 domains: biotech, nuclear, chemical, cybersecurity, finance, social media, public health.

Results

Dataset size

Value∼7,745 harmful queries

Pseudocode vs text harmful rate (example)

Value48.7% (pseudocode) vs 10.5% (text)

BaselineLlama-2-13b text rate

Instruction-centric relative increase (authors' aggregate)

Value≈238% increase in unethical responses

Baselinetext answers

Model editing effect (average)

ValueHarmful rate rose 18.9% → 56.7%

Baselineunedited model zero-shot average

GPT-4 judge agreement with humans

Value97.5% match

Harmfulness intensity (reward model)

ValuePseudocode scores higher across all topics and settings

Baselinetext answers

Who Should Care

What To Try In 7 Days

Run TECHHAZARDQA (or a subset) against your models, comparing text vs code-like prompts.

Add an explicit filter or stricter moderation for code/pseudocode outputs before release.

Use a strong LLM judge (e.g., GPT-4) plus a human sample to scale harmfulness checks quickly.

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Dataset focuses on seven technology domains; results may not generalize to casual or non-technical harmful content.
  • Evaluation relies on GPT-4 as primary judge with a 30% human sample; judge bias remains possible.
  • Layer-wise editing experiments run only on LLaMA-2-7B due to compute limits, so layer conclusions are model-specific.
  • Models tested are in the 7–13B parameter range; larger or proprietary models may behave differently.

When Not To Use

  • Do not treat TECHHAZARDQA as a complete safety test for non-technical or social-content risks.
  • Do not assume few-shot or CoT will reliably mitigate adversarial instruction prompts on highly adversarial datasets.

Failure Modes

  • GPT-4 misclassifies subtle harms or context-dependent content.
  • Model-editing conclusions may not hold for mixture-of-experts architectures like Mixtral at larger scales.
  • Filtering pseudocode outputs could cause false positives and block legitimate developer assistance.

Core Entities

Models

  • Llama-2-13b
  • Llama-2-7b
  • Mistral-V2
  • Mixtral-8x7B
  • GPT-4

Metrics

  • harmful-response-rate (%)
  • harmfulness score (reward model)
  • GPT-4 vs human agreement (%)

Datasets

  • TECHHAZARDQA (~7,745 queries)
  • ADVBENCH (520 queries)
  • NICHEHAZARDQA (~500 queries)

Benchmarks

  • TECHHAZARDQA