AdversaRiskQA: adversarial factuality benchmark for health, finance, and law

January 21, 20267 min

Overview

Decision SnapshotNeeds Validation

The benchmark and evaluators are practical and reproducible; judge agrees with humans 90–95%, but datasets are small, English-only, and some methods depend on costly web-based evaluation.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 1/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 70%

Novelty: 60%

Authors

Adam Szelestey, Sofie van Engelen, Tianhao Huang, Justin Snelders, Qintao Zeng, Songgaojun Deng

Links

Abstract / PDF / Code / Data

Why It Matters For Business

AdversaRiskQA helps reveal whether an LLM will accept confidently stated falsehoods in health, finance, or law—areas where mistakes have real consequences. Use it to catch domain-specific weaknesses before deployment.

Who Should Care

Summary TLDR

This paper introduces AdversaRiskQA, a new adversarial factuality benchmark for high-risk domains (health, finance, law). Each domain has basic and advanced items. The authors provide two automated evaluators (an LLM-as-judge and a search-augmented agentic factuality checker) and test six models (Qwen3 series, GPT-OSS 20B/120B, GPT-5). On cleaned data, Qwen3-Next-80B scores 94.7% mean accuracy and GPT-5 91.4% (Table 4). Long-form checks on Qwen3-30B show high F1@8 (≈89–92) and a negligible overall drop in correct facts under adversarial injection (mean -0.05). Code and prompts are open-sourced.

Problem Statement

There is no domain-specific, high-quality benchmark that tests how LLMs handle confidently framed, injected misinformation (adversarial factuality) in high-risk areas or how such injections affect long-form factuality.

Main Contribution

AdversaRiskQA: a three-domain (health, finance, law) adversarial QA benchmark with basic and advanced splits.

Two automated evaluators: an LLM-as-judge template and a search-augmented agentic long-form factuality checker adapted from SAFE.

Key Findings

Top open model (Qwen3-Next-80B) achieves high accuracy after filtering invalid outputs.

NumbersFiltered mean accuracy = 94.7% (Table 4)

Practical UseRun adversarial checks and filter invalid responses first; strong open models can reach >90% correction on curated adversarial QA.

Evidence RefTable 4

GPT-5 performs strongly across domains.

NumbersFiltered mean accuracy = 91.4% (Table 4)

Practical UseClosed‑source SOTA models remain reliable defenders against confident misinformation but still need domain checks.

Evidence RefTable 4

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy94.7%All domains (Table 4)Qwen3-Next-80B filtered mean shown in Table 4Table 4
Accuracy91.4%All domains (Table 4)GPT-5 filtered mean shown in Table 4Table 4

What To Try In 7 Days

Run AdversaRiskQA on your model to find domain blind spots and failure modes.

Filter out invalid outputs (null, prompt echo, template leakage) before scoring.

Add an LLM-as-judge plus a small manual review loop to scale factuality checks.

Agent Features

Planning
self-decomposition of long answers into facts
Tool Use
web_search toolOpenAI Response API
Frameworks
SAFE-inspired factuality pipeline
Is Agentic

Yes

Architectures
LLM-as-a-judgesearch-augmented agentic evaluator

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Datasets are relatively small and English-only, limiting generalization.

Long-form factuality check was run on one representative model only.

When Not To Use

If you need multilingual adversarial testing (benchmark is English-only).

As the sole safety check for production-critical decisions without expert review.

Failure Modes

Null outputs (empty/invalid responses)

Prompt echo (model repeats prompt instead of answering)

Core Entities

Models

Qwen3-4BQwen3-30BQwen3-Next-80BGPT-OSS-20BGPT-OSS-120BGPT-5GPT-5-mini

Metrics

AccuracyF1@KMean #factsMean #correct facts

Datasets

AdversaRiskQAHealthFCFALQUGPT-5 generated finance facts

Benchmarks

AdversaRiskQA