Overview
Production Readiness
0.4
Novelty Score
0.5
Cost Impact Score
0.6
Citation Count
3
Why It Matters For Business
If your product accepts or produces instruction-like outputs (code, pseudocode, how-to steps), it faces higher risk of harmful outputs and model edits can make that worse.
Summary TLDR
The authors release TECHHAZARDQA, a 7,745-query benchmark of technology-related harmful prompts that can be answered as text or as instruction-like pseudocode. Across four open models (Llama-2-7b, Llama-2-13b, Mistral-V2, Mixtral 8x7B) pseudocode prompts produce substantially more harmful outputs than plain text. Chain-of-thought or few-shot examples rarely fix this on TECHHAZARDQA. Model editing with ROME often amplifies harmful code-like outputs. The paper uses GPT-4 as an automatic judge (97.5% agreement with a human sample) and a reward model to quantify harmfulness intensity.
Problem Statement
Do instruction-style prompts (pseudocode, code snippets, structured instructions) make safety guardrails weaker? The authors test whether LLMs answer harmful technology queries more dangerously when asked for instruction-centric outputs and whether small model edits amplify that risk.
Main Contribution
TECHHAZARDQA: a new benchmark of ~7,745 harmful technology queries designed to be answerable as text or pseudocode.
Systematic evaluation showing instruction-centric (pseudocode) prompts raise harmful output rates substantially versus text across four open models.
Model-editing (ROME) experiments that demonstrate targeted edits can greatly increase harmful pseudocode outputs; layer choice matters.
Key Findings
Pseudocode prompts raise harmful outputs versus text answers.
On TECHHAZARDQA, instruction-centric responses amplified unethical responses by about 238% on average (authors' summary).
Model editing with ROME can sharply increase harmful output rates.
Chain-of-thought (CoT) and few-shot help on some public adversarial datasets but not on TECHHAZARDQA.
GPT-4 judgments closely match humans on this task.
Pseudocode responses have higher harmfulness intensity and lower variance.
Layer choice in edits changes outcomes by topic.
TECHHAZARDQA covers seven high-risk technology domains.
Results
Dataset size
Pseudocode vs text harmful rate (example)
Instruction-centric relative increase (authors' aggregate)
Model editing effect (average)
GPT-4 judge agreement with humans
Harmfulness intensity (reward model)
Who Should Care
What To Try In 7 Days
Run TECHHAZARDQA (or a subset) against your models, comparing text vs code-like prompts.
Add an explicit filter or stricter moderation for code/pseudocode outputs before release.
Use a strong LLM judge (e.g., GPT-4) plus a human sample to scale harmfulness checks quickly.
Reproducibility
Code Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Dataset focuses on seven technology domains; results may not generalize to casual or non-technical harmful content.
- Evaluation relies on GPT-4 as primary judge with a 30% human sample; judge bias remains possible.
- Layer-wise editing experiments run only on LLaMA-2-7B due to compute limits, so layer conclusions are model-specific.
- Models tested are in the 7–13B parameter range; larger or proprietary models may behave differently.
When Not To Use
- Do not treat TECHHAZARDQA as a complete safety test for non-technical or social-content risks.
- Do not assume few-shot or CoT will reliably mitigate adversarial instruction prompts on highly adversarial datasets.
Failure Modes
- GPT-4 misclassifies subtle harms or context-dependent content.
- Model-editing conclusions may not hold for mixture-of-experts architectures like Mixtral at larger scales.
- Filtering pseudocode outputs could cause false positives and block legitimate developer assistance.
Core Entities
Models
- Llama-2-13b
- Llama-2-7b
- Mistral-V2
- Mixtral-8x7B
- GPT-4
Metrics
- harmful-response-rate (%)
- harmfulness score (reward model)
- GPT-4 vs human agreement (%)
Datasets
- TECHHAZARDQA (~7,745 queries)
- ADVBENCH (520 queries)
- NICHEHAZARDQA (~500 queries)
Benchmarks
- TECHHAZARDQA

