A module-oriented survey that maps safety risks, defenses, and benchmarks across input, model, toolchain, and output components

January 11, 20247 min

Overview

Production Readiness

0.6

Novelty Score

0.4

Cost Impact Score

0.5

Citation Count

13

Authors

Tianyu Cui, Yanling Wang, Chuanpu Fu, Yong Xiao, Sijia Li, Xinhao Deng, Yunpeng Liu, Qinglin Zhang, Ziyi Qiu, Peiyang Li, Zhixing Tan, Junwu Xiong, Xinyu Kong, Zujie Wen, Ke Xu, Qi Li

Links

Abstract / PDF

Why It Matters For Business

Mapping risks to system modules lets teams prioritize fixes (input guards, data curation, toolchain hardening, output filters) and reduce privacy, legal, and outage risks.

Summary TLDR

This paper is a focused survey that organizes safety and security risks of LLM systems by module (input, language model, toolchain, output). It lists 12 risks and 44 subtopics, maps concrete mitigation patterns (e.g., input guards, differential privacy, RLHF, provenance, output detectors and watermarking), and reviews common benchmarks for robustness, hallucination, toxicity, privacy, and bias. The goal is practical: help engineers locate root causes and pick module-level defenses and assessments.

Problem Statement

LLM safety work is scattered across content-level metrics and ad hoc fixes. Engineers lack a compact, module-oriented view that maps each risk to the system component (input, model, toolchain, output), the standard mitigations, and the benchmarks needed to test them.

Main Contribution

A module-oriented taxonomy that links risks to four system modules: input, language model, toolchain, and output.

A systematic survey of mitigation strategies organized by module (35 sub-techniques across 12 defenses).

A review and catalog of prevalent benchmarks and datasets for evaluating robustness, hallucination, toxicity, privacy, and bias.

Key Findings

LLM risks are multi-source and map cleanly to system modules.

Numberstaxonomy: 4 modules, 12 risks, 44 sub-topics

Toxic content exists in large pretraining corpora; simple corpus contamination is measurable.

NumbersLLaMA2 pretraining ~0.2% documents labeled toxic

Adversarial and jailbreak prompts are abundant and effective.

NumbersPromptBench collects 583,884 adversarial examples; Prompt injection and jailbreak taxonomies detailed

Hallucinations remain common and take distinct forms with measurable shares in evaluations.

NumbersMedical-domain hallucination splits: GPT-3.5 27%/43%; GPT-4 25%/33%; Bard 8%/44%

Toolchain and infrastructure introduce novel attack surfaces (supply chain, CVEs, hardware side-channels).

NumbersMultiple CVEs referenced across runtime, frameworks, pre-processing (e.g., CVE-2022-48564, CVE-2023-25674)

Results

toxic documents in pretraining corpus (LLaMA2)

Value≈0.2% of documents labeled toxic

adversarial examples catalogued

Value583,884 adversarial examples in PromptBench

medical-domain hallucination shares (two types)

ValueGPT-3.5: 27%/43% ; GPT-4: 25%/33% ; Bard: 8%/44%

Who Should Care

What To Try In 7 Days

Run a red-team suite of adversarial prompts (e.g., samples from PromptBench) against your chat flow.

Add an input safeguard: format-enforced prompts + a lightweight prompt classifier.

Scan training/fine-tuning corpora for duplicates, PII and toxic text; apply selective sanitization/deduplication.

Reproducibility

Open Source Status

  • no

Risks & Boundaries

Limitations

  • Survey paper — no new technical defenses or experiments.
  • Benchmarks evolve rapidly; catalog may miss newest datasets or attacks.
  • High-level mitigations need engineering work to integrate into production.

When Not To Use

  • As a substitute for a formal security audit of code, infra, or RLHF pipelines.
  • To claim that any single mitigation will eliminate risk without empirical validation.

Failure Modes

  • Taxonomy becomes outdated as new jailbreaks and CVEs appear.
  • Benchmarks can have blind spots (judge bias, dataset leakage, limited domain coverage).
  • Watermarks and detectors can be bypassed by paraphrasing and adversaries.

Core Entities

Models

  • GPT-4
  • ChatGPT
  • GPT-3.5
  • GPT-3
  • LLaMA
  • LLaMA2
  • Flan-T5
  • BLOOM
  • GPT-Neo
  • Alpaca
  • Vicuna
  • Bard

Metrics

  • adversarial robustness
  • out-of-distribution robustness
  • truthfulness
  • hallucination rate
  • toxicity
  • bias
  • privacy leakage

Datasets

  • PromptBench
  • AdvGLUE
  • ANLI
  • GLUE-X
  • BOSS
  • TruthfulQA
  • HaDes
  • Wikibro
  • Med-HALT
  • HaluEval
  • Levy/Holt
  • REALTOXICITYPROMPTS
  • CommonClaim
  • HateXplain
  • TOXIGEN
  • COLD
  • SafetyPrompts
  • CValues
  • FaiRLLM
  • BOLD
  • StereoSet
  • HOLISTICBIAS
  • CDial-Bias

Benchmarks

  • PromptBench
  • AdvGLUE
  • TruthfulQA
  • HaDes
  • Med-HALT
  • HaluEval
  • REALTOXICITYPROMPTS
  • TOXIGEN
  • BOSS
  • GLUE-X
  • BOLD
  • StereoSet