Overview
The taxonomy and benchmark catalog are practically useful for audits and product planning, but the paper is a survey (no new tooling). Benchmarks and CVE lists must be kept current; deployers should combine these recommendations with live security practices.
Citations13
Evidence Strength0.70
Confidence0.80
Risk Signals8
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 0/3
Reproducibility
Status: No open assets linked
Open source: No
At A Glance
Cost impact: 50%
Production readiness: 60%
Novelty: 40%
Why It Matters For Business
Mapping risks to system modules lets teams prioritize fixes (input guards, data curation, toolchain hardening, output filters) and reduce privacy, legal, and outage risks.
Who Should Care
Summary TLDR
This paper is a focused survey that organizes safety and security risks of LLM systems by module (input, language model, toolchain, output). It lists 12 risks and 44 subtopics, maps concrete mitigation patterns (e.g., input guards, differential privacy, RLHF, provenance, output detectors and watermarking), and reviews common benchmarks for robustness, hallucination, toxicity, privacy, and bias. The goal is practical: help engineers locate root causes and pick module-level defenses and assessments.
Problem Statement
LLM safety work is scattered across content-level metrics and ad hoc fixes. Engineers lack a compact, module-oriented view that maps each risk to the system component (input, model, toolchain, output), the standard mitigations, and the benchmarks needed to test them.
Main Contribution
A module-oriented taxonomy that links risks to four system modules: input, language model, toolchain, and output.
A systematic survey of mitigation strategies organized by module (35 sub-techniques across 12 defenses).
Key Findings
LLM risks are multi-source and map cleanly to system modules.
Toxic content exists in large pretraining corpora; simple corpus contamination is measurable.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| toxic documents in pretraining corpus (LLaMA2) | ≈0.2% of documents labeled toxic | — | — | LLaMA2 pretraining corpora (cited) | LLaMA2 dataset analysis reported ~0.2% toxic docs | Section IV.B (Toxic Training Data) citing [4] |
| adversarial examples catalogued | 583,884 adversarial examples in PromptBench | — | — | PromptBench | PromptBench includes 583,884 adversarial examples across granularities | Section VI.A (PromptBench) citing [397] |
What To Try In 7 Days
Run a red-team suite of adversarial prompts (e.g., samples from PromptBench) against your chat flow.
Add an input safeguard: format-enforced prompts + a lightweight prompt classifier.
Scan training/fine-tuning corpora for duplicates, PII and toxic text; apply selective sanitization/deduplication.
Reproducibility
Risks & Boundaries
Limitations
Survey paper — no new technical defenses or experiments.
Benchmarks evolve rapidly; catalog may miss newest datasets or attacks.
When Not To Use
As a substitute for a formal security audit of code, infra, or RLHF pipelines.
To claim that any single mitigation will eliminate risk without empirical validation.
Failure Modes
Taxonomy becomes outdated as new jailbreaks and CVEs appear.
Benchmarks can have blind spots (judge bias, dataset leakage, limited domain coverage).

