Overview
The benchmark and evaluators are practical and reproducible; judge agrees with humans 90–95%, but datasets are small, English-only, and some methods depend on costly web-based evaluation.
Citations0
Evidence Strength0.80
Confidence0.85
Risk Signals11
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 1/5
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 40%
Production readiness: 70%
Novelty: 60%
Why It Matters For Business
AdversaRiskQA helps reveal whether an LLM will accept confidently stated falsehoods in health, finance, or law—areas where mistakes have real consequences. Use it to catch domain-specific weaknesses before deployment.
Who Should Care
Summary TLDR
This paper introduces AdversaRiskQA, a new adversarial factuality benchmark for high-risk domains (health, finance, law). Each domain has basic and advanced items. The authors provide two automated evaluators (an LLM-as-judge and a search-augmented agentic factuality checker) and test six models (Qwen3 series, GPT-OSS 20B/120B, GPT-5). On cleaned data, Qwen3-Next-80B scores 94.7% mean accuracy and GPT-5 91.4% (Table 4). Long-form checks on Qwen3-30B show high F1@8 (≈89–92) and a negligible overall drop in correct facts under adversarial injection (mean -0.05). Code and prompts are open-sourced.
Problem Statement
There is no domain-specific, high-quality benchmark that tests how LLMs handle confidently framed, injected misinformation (adversarial factuality) in high-risk areas or how such injections affect long-form factuality.
Main Contribution
AdversaRiskQA: a three-domain (health, finance, law) adversarial QA benchmark with basic and advanced splits.
Two automated evaluators: an LLM-as-judge template and a search-augmented agentic long-form factuality checker adapted from SAFE.
Key Findings
Top open model (Qwen3-Next-80B) achieves high accuracy after filtering invalid outputs.
GPT-5 performs strongly across domains.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | 94.7% | — | — | All domains (Table 4) | Qwen3-Next-80B filtered mean shown in Table 4 | Table 4 |
| Accuracy | 91.4% | — | — | All domains (Table 4) | GPT-5 filtered mean shown in Table 4 | Table 4 |
What To Try In 7 Days
Run AdversaRiskQA on your model to find domain blind spots and failure modes.
Filter out invalid outputs (null, prompt echo, template leakage) before scoring.
Add an LLM-as-judge plus a small manual review loop to scale factuality checks.
Agent Features
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Reproducibility
Risks & Boundaries
Limitations
Datasets are relatively small and English-only, limiting generalization.
Long-form factuality check was run on one representative model only.
When Not To Use
If you need multilingual adversarial testing (benchmark is English-only).
As the sole safety check for production-critical decisions without expert review.
Failure Modes
Null outputs (empty/invalid responses)
Prompt echo (model repeats prompt instead of answering)

