A module-oriented survey that maps safety risks, defenses, and benchmarks across input, model, toolchain, and output components

Overview

Decision SnapshotNeeds Validation

The taxonomy and benchmark catalog are practically useful for audits and product planning, but the paper is a survey (no new tooling). Benchmarks and CVE lists must be kept current; deployers should combine these recommendations with live security practices.

Citations13

Evidence Strength0.70

Confidence0.80

Risk Signals8

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 0/3

Reproducibility

Status: No open assets linked

Open source: No

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 40%

Authors

Tianyu Cui, Yanling Wang, Chuanpu Fu, Yong Xiao, Sijia Li, Xinhao Deng, Yunpeng Liu, Qinglin Zhang, Ziyi Qiu, Peiyang Li, Zhixing Tan, Junwu Xiong, Xinyu Kong, Zujie Wen, Ke Xu, Qi Li

Links

Abstract / PDF

Why It Matters For Business

Mapping risks to system modules lets teams prioritize fixes (input guards, data curation, toolchain hardening, output filters) and reduce privacy, legal, and outage risks.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Data Scientist CEO

Summary TLDR

This paper is a focused survey that organizes safety and security risks of LLM systems by module (input, language model, toolchain, output). It lists 12 risks and 44 subtopics, maps concrete mitigation patterns (e.g., input guards, differential privacy, RLHF, provenance, output detectors and watermarking), and reviews common benchmarks for robustness, hallucination, toxicity, privacy, and bias. The goal is practical: help engineers locate root causes and pick module-level defenses and assessments.

Problem Statement

LLM safety work is scattered across content-level metrics and ad hoc fixes. Engineers lack a compact, module-oriented view that maps each risk to the system component (input, model, toolchain, output), the standard mitigations, and the benchmarks needed to test them.

Main Contribution

A module-oriented taxonomy that links risks to four system modules: input, language model, toolchain, and output.

A systematic survey of mitigation strategies organized by module (35 sub-techniques across 12 defenses).

Key Findings

LLM risks are multi-source and map cleanly to system modules.

Numberstaxonomy: 4 modules, 12 risks, 44 sub-topics

Practical UseWhen an issue appears (e.g., a data leak), inspect input, model, and toolchain separately to pick targeted defenses instead of only tuning the model.

Evidence RefFigure 3 and text (Section III–IV)

Toxic content exists in large pretraining corpora; simple corpus contamination is measurable.

NumbersLLaMA2 pretraining ~0.2% documents labeled toxic

Practical UseAdd toxicity scanning and targeted filtering during data curation; expect small fractions of toxic documents to still affect outputs.

Evidence RefSection IV, Toxic Training Data (LLaMA2 citation [4])

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
toxic documents in pretraining corpus (LLaMA2)	≈0.2% of documents labeled toxic	—	—	LLaMA2 pretraining corpora (cited)	LLaMA2 dataset analysis reported ~0.2% toxic docs	Section IV.B (Toxic Training Data) citing [4]
adversarial examples catalogued	583,884 adversarial examples in PromptBench	—	—	PromptBench	PromptBench includes 583,884 adversarial examples across granularities	Section VI.A (PromptBench) citing [397]

What To Try In 7 Days

Run a red-team suite of adversarial prompts (e.g., samples from PromptBench) against your chat flow.

Add an input safeguard: format-enforced prompts + a lightweight prompt classifier.

Scan training/fine-tuning corpora for duplicates, PII and toxic text; apply selective sanitization/deduplication.

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusNo

LicenseUnknown

Risks & Boundaries

Limitations

Survey paper — no new technical defenses or experiments.

Benchmarks evolve rapidly; catalog may miss newest datasets or attacks.

When Not To Use

As a substitute for a formal security audit of code, infra, or RLHF pipelines.

To claim that any single mitigation will eliminate risk without empirical validation.

Failure Modes

Taxonomy becomes outdated as new jailbreaks and CVEs appear.

Benchmarks can have blind spots (judge bias, dataset leakage, limited domain coverage).

Core Entities

Models

GPT-4ChatGPTGPT-3.5GPT-3LLaMALLaMA2Flan-T5BLOOMGPT-NeoAlpacaVicunaBard

Metrics

adversarial robustnessout-of-distribution robustnesstruthfulnesshallucination ratetoxicitybiasprivacy leakage

Datasets

PromptBenchAdvGLUEANLIGLUE-XBOSSTruthfulQAHaDesWikibroMed-HALTHaluEvalLevy/HoltREALTOXICITYPROMPTSCommonClaimHateXplainTOXIGENCOLDSafetyPromptsCValuesFaiRLLMBOLDStereoSetHOLISTICBIASCDial-Bias

Benchmarks

PromptBenchAdvGLUETruthfulQAHaDesMed-HALTHaluEvalREALTOXICITYPROMPTSTOXIGENBOSSGLUE-XBOLDStereoSet

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

LLM risks are multi-source and map cleanly to system modules.

Toxic content exists in large pretraining corpora; simple corpus contamination is measurable.

Results

What To Try In 7 Days

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

ThaiSafetyBench: 1,954 Thai malicious prompts reveal cultural blind spots in LLM safety

Key finding

Model judges reward ethics-based refusals; human users penalize them

Key finding

A 300k-case, 22-language benchmark that tests how jailbreak prompts make LLMs write fake news

Key finding

MEDIC: a practical framework to test clinical LLM safety, hallucinations, and operational utility

Key finding

A balanced 44-class benchmark (440 prompts + 8.8K mutations) for testing whether LLMs refuse unsafe requests, plus a fast judge design.

Key finding