A module-oriented survey that maps safety risks, defenses, and benchmarks across input, model, toolchain, and output components

January 11, 20247 min

Overview

Decision SnapshotNeeds Validation

The taxonomy and benchmark catalog are practically useful for audits and product planning, but the paper is a survey (no new tooling). Benchmarks and CVE lists must be kept current; deployers should combine these recommendations with live security practices.

Citations13

Evidence Strength0.70

Confidence0.80

Risk Signals8

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 0/3

Reproducibility

Status: No open assets linked

Open source: No

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 40%

Authors

Tianyu Cui, Yanling Wang, Chuanpu Fu, Yong Xiao, Sijia Li, Xinhao Deng, Yunpeng Liu, Qinglin Zhang, Ziyi Qiu, Peiyang Li, Zhixing Tan, Junwu Xiong, Xinyu Kong, Zujie Wen, Ke Xu, Qi Li

Links

Abstract / PDF

Why It Matters For Business

Mapping risks to system modules lets teams prioritize fixes (input guards, data curation, toolchain hardening, output filters) and reduce privacy, legal, and outage risks.

Who Should Care

Summary TLDR

This paper is a focused survey that organizes safety and security risks of LLM systems by module (input, language model, toolchain, output). It lists 12 risks and 44 subtopics, maps concrete mitigation patterns (e.g., input guards, differential privacy, RLHF, provenance, output detectors and watermarking), and reviews common benchmarks for robustness, hallucination, toxicity, privacy, and bias. The goal is practical: help engineers locate root causes and pick module-level defenses and assessments.

Problem Statement

LLM safety work is scattered across content-level metrics and ad hoc fixes. Engineers lack a compact, module-oriented view that maps each risk to the system component (input, model, toolchain, output), the standard mitigations, and the benchmarks needed to test them.

Main Contribution

A module-oriented taxonomy that links risks to four system modules: input, language model, toolchain, and output.

A systematic survey of mitigation strategies organized by module (35 sub-techniques across 12 defenses).

Key Findings

LLM risks are multi-source and map cleanly to system modules.

Numberstaxonomy: 4 modules, 12 risks, 44 sub-topics

Practical UseWhen an issue appears (e.g., a data leak), inspect input, model, and toolchain separately to pick targeted defenses instead of only tuning the model.

Evidence RefFigure 3 and text (Section III–IV)

Toxic content exists in large pretraining corpora; simple corpus contamination is measurable.

NumbersLLaMA2 pretraining ~0.2% documents labeled toxic

Practical UseAdd toxicity scanning and targeted filtering during data curation; expect small fractions of toxic documents to still affect outputs.

Evidence RefSection IV, Toxic Training Data (LLaMA2 citation [4])

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
toxic documents in pretraining corpus (LLaMA2)≈0.2% of documents labeled toxicLLaMA2 pretraining corpora (cited)LLaMA2 dataset analysis reported ~0.2% toxic docsSection IV.B (Toxic Training Data) citing [4]
adversarial examples catalogued583,884 adversarial examples in PromptBenchPromptBenchPromptBench includes 583,884 adversarial examples across granularitiesSection VI.A (PromptBench) citing [397]

What To Try In 7 Days

Run a red-team suite of adversarial prompts (e.g., samples from PromptBench) against your chat flow.

Add an input safeguard: format-enforced prompts + a lightweight prompt classifier.

Scan training/fine-tuning corpora for duplicates, PII and toxic text; apply selective sanitization/deduplication.

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusNo
LicenseUnknown

Risks & Boundaries

Limitations

Survey paper — no new technical defenses or experiments.

Benchmarks evolve rapidly; catalog may miss newest datasets or attacks.

When Not To Use

As a substitute for a formal security audit of code, infra, or RLHF pipelines.

To claim that any single mitigation will eliminate risk without empirical validation.

Failure Modes

Taxonomy becomes outdated as new jailbreaks and CVEs appear.

Benchmarks can have blind spots (judge bias, dataset leakage, limited domain coverage).

Core Entities

Models

GPT-4ChatGPTGPT-3.5GPT-3LLaMALLaMA2Flan-T5BLOOMGPT-NeoAlpacaVicunaBard

Metrics

adversarial robustnessout-of-distribution robustnesstruthfulnesshallucination ratetoxicitybiasprivacy leakage

Datasets

PromptBenchAdvGLUEANLIGLUE-XBOSSTruthfulQAHaDesWikibroMed-HALTHaluEvalLevy/HoltREALTOXICITYPROMPTSCommonClaimHateXplainTOXIGENCOLDSafetyPromptsCValuesFaiRLLMBOLDStereoSetHOLISTICBIASCDial-Bias

Benchmarks

PromptBenchAdvGLUETruthfulQAHaDesMed-HALTHaluEvalREALTOXICITYPROMPTSTOXIGENBOSSGLUE-XBOLDStereoSet