AdversaRiskQA: adversarial factuality benchmark for health, finance, and law

Overview

Decision SnapshotNeeds Validation

The benchmark and evaluators are practical and reproducible; judge agrees with humans 90–95%, but datasets are small, English-only, and some methods depend on costly web-based evaluation.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 1/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 70%

Novelty: 60%

Authors

Adam Szelestey, Sofie van Engelen, Tianhao Huang, Justin Snelders, Qintao Zeng, Songgaojun Deng

Links

Abstract / PDF / Code / Data

Why It Matters For Business

AdversaRiskQA helps reveal whether an LLM will accept confidently stated falsehoods in health, finance, or law—areas where mistakes have real consequences. Use it to catch domain-specific weaknesses before deployment.

Who Should Care

ML Engineer Product Manager CTO Data Scientist

Summary TLDR

This paper introduces AdversaRiskQA, a new adversarial factuality benchmark for high-risk domains (health, finance, law). Each domain has basic and advanced items. The authors provide two automated evaluators (an LLM-as-judge and a search-augmented agentic factuality checker) and test six models (Qwen3 series, GPT-OSS 20B/120B, GPT-5). On cleaned data, Qwen3-Next-80B scores 94.7% mean accuracy and GPT-5 91.4% (Table 4). Long-form checks on Qwen3-30B show high F1@8 (≈89–92) and a negligible overall drop in correct facts under adversarial injection (mean -0.05). Code and prompts are open-sourced.

Problem Statement

There is no domain-specific, high-quality benchmark that tests how LLMs handle confidently framed, injected misinformation (adversarial factuality) in high-risk areas or how such injections affect long-form factuality.

Main Contribution

AdversaRiskQA: a three-domain (health, finance, law) adversarial QA benchmark with basic and advanced splits.

Two automated evaluators: an LLM-as-judge template and a search-augmented agentic long-form factuality checker adapted from SAFE.

Key Findings

Top open model (Qwen3-Next-80B) achieves high accuracy after filtering invalid outputs.

NumbersFiltered mean accuracy = 94.7% (Table 4)

Practical UseRun adversarial checks and filter invalid responses first; strong open models can reach >90% correction on curated adversarial QA.

Evidence RefTable 4

GPT-5 performs strongly across domains.

NumbersFiltered mean accuracy = 91.4% (Table 4)

Practical UseClosed‑source SOTA models remain reliable defenders against confident misinformation but still need domain checks.

Evidence RefTable 4

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	94.7%	—	—	All domains (Table 4)	Qwen3-Next-80B filtered mean shown in Table 4	Table 4
Accuracy	91.4%	—	—	All domains (Table 4)	GPT-5 filtered mean shown in Table 4	Table 4

What To Try In 7 Days

Run AdversaRiskQA on your model to find domain blind spots and failure modes.

Filter out invalid outputs (null, prompt echo, template leakage) before scoring.

Add an LLM-as-judge plus a small manual review loop to scale factuality checks.

Agent Features

Planning

self-decomposition of long answers into facts

Tool Use

web_search toolOpenAI Response API

Frameworks

SAFE-inspired factuality pipeline

Is Agentic

Yes

Architectures

LLM-as-a-judgesearch-augmented agentic evaluator

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://anonymous.4open.science/r/AdversaRiskQA-8E57

Data URLs

https://anonymous.4open.science/r/AdversaRiskQA-8E57

Risks & Boundaries

Limitations

Datasets are relatively small and English-only, limiting generalization.

Long-form factuality check was run on one representative model only.

When Not To Use

If you need multilingual adversarial testing (benchmark is English-only).

As the sole safety check for production-critical decisions without expert review.

Failure Modes

Null outputs (empty/invalid responses)

Prompt echo (model repeats prompt instead of answering)

Core Entities

Models

Qwen3-4BQwen3-30BQwen3-Next-80BGPT-OSS-20BGPT-OSS-120BGPT-5GPT-5-mini

Metrics

AccuracyF1@KMean #factsMean #correct facts

Datasets

AdversaRiskQAHealthFCFALQUGPT-5 generated finance facts

Benchmarks

AdversaRiskQA

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Top open model (Qwen3-Next-80B) achieves high accuracy after filtering invalid outputs.

GPT-5 performs strongly across domains.

Results

What To Try In 7 Days

Agent Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding