Alignment and Safety Papers — Parsed & Scored for Practitioners

Llama 2: open-release of 7B–70B pretrained models and RLHF‑tuned chat models competitive on human tests

0.70

0.30

0.60

2,595

Llama 2 provides openly available pretrained and RLHF‑tuned chat models that are competitive with closed models on many human-evaluated tasks, lowering the entry cost for companies that need high-quality chat AI while allowing customization and internal safety tuning.

Key finding

Llama 2 pretrained on ~2 trillion tokens; models range 7B–70B parameters.

Numbers: 2.0T tokens; sizes 7B,13B,34B,70B

Practical review of data, training, and evaluation methods to align LLMs with human preferences

0.60

0.40

0.70

54

Aligning LLMs reduces risky outputs and increases usefulness; using parameter-efficient tuning cuts compute costs and enables faster iteration.

Key finding

Small sets of high-quality instructions can suffice to produce alignment effects.

Numbers: LLaMA needs ~8K instructions (IFS); other work reports ~6K high-quality instructions

Llama Guard — an adaptable LLM filter that flags unsafe user prompts and AI responses

0.70

0.35

0.45

44

Llama Guard is a deployable, customizable safety filter that runs locally, adapts to new policies via prompts or light fine-tuning, and matches or beats common moderation APIs on public and internal tests.

Key finding

High in-policy classification performance on internal test set.

Numbers: AUPRC prompt=0.945; response=0.953 (Table 2)

Practical, end-to-end guide to fine-tuning LLMs: pipelines, PEFT, RAG, alignment and deployment

0.70

0.35

0.70

39

Fine-tuning and RAG let you customise LLM behavior and accuracy while controlling cost; PEFT and quantisation let you ship tailored models without enterprise-scale GPU fleets.

Key finding

QLoRA compresses model parameters and enables 4-bit fine-tuning while retaining near-16-bit performance.

Numbers: Reduces to ~5.2 bits/parameter (from 96 bits); ~18x memory reduction

Practical survey of methods, attacks, and evaluations for aligning large language models

0.45

0.40

0.50

34

Misaligned LLMs can produce legal, reputational, and safety failures. Alignment methods reduce harmful outputs but need governance, red-teaming, and evaluation to manage adversarial and privacy risks.

Key finding

Reinforcement Learning from Human Feedback (RLHF) is the most common non-recursive oversight method for aligning LLMs.

Decouple helpfulness and harmlessness, then use a Lagrangian Safe-RL step to trade off both during RLHF

0.50

0.60

0.40

20

Safe RLHF lets you improve usefulness without sacrificing safety by separating labels and using a dynamic constraint; this reduces harmful outputs strongly while preserving or increasing helpfulness, lowering moderation load and risk.

Key finding

Iterative Safe RLHF reduced measured harmful responses from Alpaca-7B's 53.08% to 2.45% on the paper's evaluation set.

Numbers: Harmful probability 53.08% → 2.45%

FaiRLLM: a benchmark showing ChatGPT gives uneven recommendations across user attributes

0.50

0.60

0.25

17

If you use LLMs to generate recommendations, they can favor or disfavor user groups; auditing with a generative-aware fairness test prevents reputational and regulatory risk.

Key finding

ChatGPT shows measurable unfairness on movie recommendations when measured by pairwise ranking agreement (PRAG*@20).

Numbers: Movie PRAG*@20 SNSR up to 0.2191; SNSV up to 0.0828 (Table 1)

LLMs write biased recommendation letters: women as warm, men as leaders

0.30

0.45

0.40

17

Automatically generated recommendation letters can embed gendered tone and hallucinated details that harm applicants and expose organizations to unfair hiring decisions and reputational or legal risk.

Key finding

Model-generated letters for men score far higher on agency than for women.

Numbers: ChatGPT agency t=10.47, p=1.02e-25 (Table 4).

Chain-of-Utterances prompts reliably jailbreak LLMs; fine-tuning on curated safe conversations reduces harm.

0.50

0.60

0.40

16

CoU-style prompts can bypass deployed guardrails often; test and harden public-facing LLMs, or fine-tune smaller models on curated safe conversations to reduce harmful outputs without losing much utility.

Key finding

RED-EVAL jailbreaks widely deployed closed-source APIs frequently.

Numbers: GPT-4 ASR 0.651; ChatGPT ASR 0.728 on tested harmful prompts

Two prompt-based tests uncover widespread implicit stereotypes in value-aligned LLMs that pass standard bias benchmarks

0.60

0.65

0.45

14

Even value-aligned, safety-trained LLMs can hold hidden associations that change outcomes in hiring, recommendations, or role assignments; prompt-based behavioral tests let you find risks without model internals.

Key finding

Prompt-based LLM Implicit Bias finds stereotype associations in 19 of 21 tested stereotype types across models.

Numbers: 19/21 stereotype types

A practical review of where LLM bias comes from, how to test it, and common fixes

0.50

0.30

0.60

13

Biased LLM outputs can cause legal risk, reputational harm, and unfair customer outcomes; fixing bias early reduces downstream remediation cost and regulatory exposure.

Key finding

Toxicity can emerge quickly from benign prompts in generative LLMs.

Numbers: toxicity > 0.5 within <100 generations

WMDP: a public 3,668-question benchmark plus RMU unlearning to measure and remove hazardous LLM knowledge

0.60

0.70

13

WMDP + RMU let providers reduce hazardous knowledge in served models and demonstrate a practical mitigation that preserves most useful capabilities, lowering legal and reputational risk from malicious model use.

Key finding

WMDP is a sizable, vetted public benchmark for hazardous knowledge.

Numbers: 3,668 multiple-choice questions; development cost >$200K

Open benchmark and a tuned LLM (CALM) show GPT-4-level credit scoring but expose measurable bias

0.60

0.50

0.40

12

LLMs can cut prototype time: GPT-4 often matches expert pipelines on some credit tasks and a tuned open model (CALM) can match closed models, but fairness checks are mandatory before any customer-facing use.

Key finding

GPT-4 can reach near-expert accuracy on some credit tasks.

Numbers: Lending Club Acc 0.762 vs SOTA 0.777; Travel Insurance F1 0.897 vs SOTA 0.912

Use small LLM agents to filter and block jailbreak responses from larger models

0.60

0.70

11

AutoDefense offers a plug-in, model-agnostic layer to block harmful outputs without retraining or changing user prompts, reducing legal and reputational risk while keeping product utility.

Key finding

Three-agent AutoDefense with LLaMA-2-13B cuts GPT-3.5 ASR from 55.74% to 7.95% on the DAN jailbreak set.

Numbers: ASR 55.74% → 7.95% (DAN, GPT-3.5 victim)

Latent-jailbreak benchmark: test if hidden malicious text breaks model safety or instruction following

0.50

0.60

0.50

11

Hidden malicious instructions can either make models produce unsafe content or cause over-refusal and lost utility; you must measure both safety and instruction-following before deployment to avoid surprise failures.

Key finding

Different models have very different jailbreak success rates on the same prompts.

Numbers: P1 jailbreak success: ChatGLM2-6B 75.3%, BELLE-7B-2M 50.4%, ChatGPT 22.6% (Table 4)

ChiMed‑GPT: a 13B Chinese medical LLM trained with pretraining, SFT and RLHF for safer, better medical answers

0.60

0.45

0.50

10

ChiMed‑GPT is a practical open-source Chinese medical LLM that gives clearer patient-facing answers, handles longer clinical text (4,096 tokens), and lowers risky biased replies — useful for telemedicine, triage bots, and medical content generation.

Key finding

Open-ended QA (BLEU-1): ChiMed‑GPT scored higher than GPT-4 on the tested dataset.

Numbers: BLEU-1 33.14 (ChiMed‑GPT) vs 24.29 (GPT-4)

A formal framework and first quantitative benchmark showing prompt-injection attacks are broadly effective and current defenses fall short

0.50

0.70

0.60

10

Any app that feeds external text into an LLM is attackable: attackers can hide instructions in user-provided data and change app outputs. Benchmarks show attacks work across models and tasks, and many defenses either miss attacks or break normal performance.

Key finding

Prompt injection attacks are broadly effective across tasks and models.

Numbers: Combined Attack ASV=0.62 and MR=0.78 averaged over 10 LLMs and 7×7 task pairs

AgentPoison: a stealthy backdoor that poisons agent memories or RAG to hijack LLM agents

0.70

9

If agents fetch data from third-party or writable corpora, an attacker can inject a few poisoned records to trigger dangerous actions while leaving overall accuracy unchanged, creating a low-noise safety and legal risk.

Key finding

AGENTPOISON forces retrieval of poisoned demonstrations with high probability.

Numbers: Average ASR-r ≈ 81.2% (retrieval success)

Have LLMs judge and train themselves: iterative self-rewards boost instruction-following and the model's own evaluator.

0.60

0.70

0.60

9

Self-rewarding training can reduce dependence on large human-preference datasets by letting an LLM generate and score its own training data, lowering labeling cost and enabling iterative improvement—but it needs monitoring for safety and domain gaps.

Key finding

Instruction-following win rate against GPT-4 Turbo (AlpacaEval 2.0) rose across iterations.

Numbers: M1 9.94% → M2 15.38% → M3 20.44%

BIPIA: a large benchmark and practical defenses for indirect prompt injection attacks on LLMs

0.70

0.60

0.50

9

External content can silently hijack LLM outputs. Measure exposure with BIPIA and add simple defenses now; full model fine-tuning yields stronger protection if you control the model.

Key finding

All evaluated LLMs show vulnerability to indirect prompt injection on BIPIA.

Numbers: Average overall ASR = 0.1179 (11.79%) on BIPIA (Table 2)

Teach an LLM to 'forget' bad behaviors using only negative examples and cheap finetuning

0.60

0.80

9

If your priority is to stop a model from producing specific harmful or copyrighted outputs quickly and cheaply, unlearning cuts those outputs dramatically with only finetune-level compute and no costly human-written positive examples.

Key finding

Unlearning can reduce harmful output rates to near zero on evaluated harmful prompts.

Numbers: harmful rate 47% -> 1% (OPT-1.3B, Table 3)

GPT-4 agents autonomously exploit sandboxed website vulnerabilities (11/15) and find at least one real XSS

0.20

0.60

8

High-capability LLM agents can automate complex web attacks at lower estimated cost than manual analysts, increasing the risk surface for companies that expose web interfaces.

Key finding

GPT-4 agent succeeded on most sandboxed vulnerabilities

Numbers: Pass@5 = 73.3%; overall success = 42.7% (Table 2)

AnimaLLM: a prototype that scores LLM outputs for truthfulness and how well they consider animals' interests

0.20

0.50

0.20

7

LLMs can embed species and welfare biases that affect products (education, vet advice, policy tools); measuring these biases early helps avoid reputational, legal, or welfare harms.

Key finding

AnimaLLM produced comprehensive score sets for two commercial LLMs.

Numbers: 3,264 S1 and 3,264 S2 scores per model

RAIN: align frozen LLMs at inference by self-evaluation and token rewinding

0.60

0.70

0.60

7

RAIN lets you reduce harmful or untruthful outputs from deployed LLMs without costly retraining or human labels; trade latency for safety and consider using RAIN-generated data to finetune if latency is critical.

Key finding

RAIN raised harmlessness of LLaMA 30B from 82% to 97% on the HH dataset.

Numbers: 82% → 97%