Overview
Production Readiness
0.6
Novelty Score
0.4
Cost Impact Score
0.5
Citation Count
13
Why It Matters For Business
Mapping risks to system modules lets teams prioritize fixes (input guards, data curation, toolchain hardening, output filters) and reduce privacy, legal, and outage risks.
Summary TLDR
This paper is a focused survey that organizes safety and security risks of LLM systems by module (input, language model, toolchain, output). It lists 12 risks and 44 subtopics, maps concrete mitigation patterns (e.g., input guards, differential privacy, RLHF, provenance, output detectors and watermarking), and reviews common benchmarks for robustness, hallucination, toxicity, privacy, and bias. The goal is practical: help engineers locate root causes and pick module-level defenses and assessments.
Problem Statement
LLM safety work is scattered across content-level metrics and ad hoc fixes. Engineers lack a compact, module-oriented view that maps each risk to the system component (input, model, toolchain, output), the standard mitigations, and the benchmarks needed to test them.
Main Contribution
A module-oriented taxonomy that links risks to four system modules: input, language model, toolchain, and output.
A systematic survey of mitigation strategies organized by module (35 sub-techniques across 12 defenses).
A review and catalog of prevalent benchmarks and datasets for evaluating robustness, hallucination, toxicity, privacy, and bias.
Key Findings
LLM risks are multi-source and map cleanly to system modules.
Toxic content exists in large pretraining corpora; simple corpus contamination is measurable.
Adversarial and jailbreak prompts are abundant and effective.
Hallucinations remain common and take distinct forms with measurable shares in evaluations.
Toolchain and infrastructure introduce novel attack surfaces (supply chain, CVEs, hardware side-channels).
Results
toxic documents in pretraining corpus (LLaMA2)
adversarial examples catalogued
medical-domain hallucination shares (two types)
Who Should Care
What To Try In 7 Days
Run a red-team suite of adversarial prompts (e.g., samples from PromptBench) against your chat flow.
Add an input safeguard: format-enforced prompts + a lightweight prompt classifier.
Scan training/fine-tuning corpora for duplicates, PII and toxic text; apply selective sanitization/deduplication.
Reproducibility
Open Source Status
- no
Risks & Boundaries
Limitations
- Survey paper — no new technical defenses or experiments.
- Benchmarks evolve rapidly; catalog may miss newest datasets or attacks.
- High-level mitigations need engineering work to integrate into production.
When Not To Use
- As a substitute for a formal security audit of code, infra, or RLHF pipelines.
- To claim that any single mitigation will eliminate risk without empirical validation.
Failure Modes
- Taxonomy becomes outdated as new jailbreaks and CVEs appear.
- Benchmarks can have blind spots (judge bias, dataset leakage, limited domain coverage).
- Watermarks and detectors can be bypassed by paraphrasing and adversaries.
Core Entities
Models
- GPT-4
- ChatGPT
- GPT-3.5
- GPT-3
- LLaMA
- LLaMA2
- Flan-T5
- BLOOM
- GPT-Neo
- Alpaca
- Vicuna
- Bard
Metrics
- adversarial robustness
- out-of-distribution robustness
- truthfulness
- hallucination rate
- toxicity
- bias
- privacy leakage
Datasets
- PromptBench
- AdvGLUE
- ANLI
- GLUE-X
- BOSS
- TruthfulQA
- HaDes
- Wikibro
- Med-HALT
- HaluEval
- Levy/Holt
- REALTOXICITYPROMPTS
- CommonClaim
- HateXplain
- TOXIGEN
- COLD
- SafetyPrompts
- CValues
- FaiRLLM
- BOLD
- StereoSet
- HOLISTICBIAS
- CDial-Bias
Benchmarks
- PromptBench
- AdvGLUE
- TruthfulQA
- HaDes
- Med-HALT
- HaluEval
- REALTOXICITYPROMPTS
- TOXIGEN
- BOSS
- GLUE-X
- BOLD
- StereoSet

