AdversaRiskQA: adversarial factuality benchmark for health, finance, and law

January 21, 20267 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.4

Citation Count

0

Authors

Adam Szelestey, Sofie van Engelen, Tianhao Huang, Justin Snelders, Qintao Zeng, Songgaojun Deng

Links

Abstract / PDF

Why It Matters For Business

AdversaRiskQA helps reveal whether an LLM will accept confidently stated falsehoods in health, finance, or law—areas where mistakes have real consequences. Use it to catch domain-specific weaknesses before deployment.

Summary TLDR

This paper introduces AdversaRiskQA, a new adversarial factuality benchmark for high-risk domains (health, finance, law). Each domain has basic and advanced items. The authors provide two automated evaluators (an LLM-as-judge and a search-augmented agentic factuality checker) and test six models (Qwen3 series, GPT-OSS 20B/120B, GPT-5). On cleaned data, Qwen3-Next-80B scores 94.7% mean accuracy and GPT-5 91.4% (Table 4). Long-form checks on Qwen3-30B show high F1@8 (≈89–92) and a negligible overall drop in correct facts under adversarial injection (mean -0.05). Code and prompts are open-sourced.

Problem Statement

There is no domain-specific, high-quality benchmark that tests how LLMs handle confidently framed, injected misinformation (adversarial factuality) in high-risk areas or how such injections affect long-form factuality.

Main Contribution

AdversaRiskQA: a three-domain (health, finance, law) adversarial QA benchmark with basic and advanced splits.

Two automated evaluators: an LLM-as-judge template and a search-augmented agentic long-form factuality checker adapted from SAFE.

Systematic evaluation of six models across sizes and families, plus manual validation to measure judge reliability.

Open-sourced prompts, datasets, and evaluation code for reproducibility.

Key Findings

Top open model (Qwen3-Next-80B) achieves high accuracy after filtering invalid outputs.

NumbersFiltered mean accuracy = 94.7% (Table 4)

GPT-5 performs strongly across domains.

NumbersFiltered mean accuracy = 91.4% (Table 4)

Adversarial injections have little consistent effect on long-form factuality for the tested model.

NumbersMean change in correct facts = -0.05 per question (Table 7)

Automated judge is largely reliable versus humans.

NumbersLLM judge agrees with human reviewers 90–95% (Section 3.2.2)

Many errors result from invalid or failed generations rather than reasoning mistakes.

Numbers175 unique removed entries due to failures (Appendix A, Table 10)

Results

Accuracy

Value94.7%

Accuracy

Value91.4%

Long-form factuality (Qwen3-30B F1@8)

ValueApprox. 89–92 (varies by domain and difficulty)

Mean change in correct facts due to adversarial injection (Qwen3-30B)

Value-0.05 correct facts per question

BaselineNon-adversarial

LLM judge vs. human agreement

Value90–95%

Who Should Care

What To Try In 7 Days

Run AdversaRiskQA on your model to find domain blind spots and failure modes.

Filter out invalid outputs (null, prompt echo, template leakage) before scoring.

Add an LLM-as-judge plus a small manual review loop to scale factuality checks.

Agent Features

Planning

  • self-decomposition of long answers into facts

Tool Use

  • web_search tool
  • OpenAI Response API

Frameworks

  • SAFE-inspired factuality pipeline

Is Agentic

true

Architectures

  • LLM-as-a-judge
  • search-augmented agentic evaluator

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Datasets are relatively small and English-only, limiting generalization.
  • Long-form factuality check was run on one representative model only.
  • Dataset creation and some evaluation steps used GPT-5, which may bias items or judging.
  • Safety filters and failed generations removed many items; results depend on filtering choices.

When Not To Use

  • If you need multilingual adversarial testing (benchmark is English-only).
  • As the sole safety check for production-critical decisions without expert review.
  • For model families not represented here without additional validation.

Failure Modes

  • Null outputs (empty/invalid responses)
  • Prompt echo (model repeats prompt instead of answering)
  • Template leakage (model outputs system/instruction text)
  • Safety-filter refusals causing truncated or missing answers

Core Entities

Models

  • Qwen3-4B
  • Qwen3-30B
  • Qwen3-Next-80B
  • GPT-OSS-20B
  • GPT-OSS-120B
  • GPT-5
  • GPT-5-mini

Metrics

  • Accuracy
  • F1@K
  • Mean #facts
  • Mean #correct facts

Datasets

  • AdversaRiskQA
  • HealthFC
  • FALQU
  • GPT-5 generated finance facts

Benchmarks

  • AdversaRiskQA