Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.4
Citation Count
0
Why It Matters For Business
AdversaRiskQA helps reveal whether an LLM will accept confidently stated falsehoods in health, finance, or law—areas where mistakes have real consequences. Use it to catch domain-specific weaknesses before deployment.
Summary TLDR
This paper introduces AdversaRiskQA, a new adversarial factuality benchmark for high-risk domains (health, finance, law). Each domain has basic and advanced items. The authors provide two automated evaluators (an LLM-as-judge and a search-augmented agentic factuality checker) and test six models (Qwen3 series, GPT-OSS 20B/120B, GPT-5). On cleaned data, Qwen3-Next-80B scores 94.7% mean accuracy and GPT-5 91.4% (Table 4). Long-form checks on Qwen3-30B show high F1@8 (≈89–92) and a negligible overall drop in correct facts under adversarial injection (mean -0.05). Code and prompts are open-sourced.
Problem Statement
There is no domain-specific, high-quality benchmark that tests how LLMs handle confidently framed, injected misinformation (adversarial factuality) in high-risk areas or how such injections affect long-form factuality.
Main Contribution
AdversaRiskQA: a three-domain (health, finance, law) adversarial QA benchmark with basic and advanced splits.
Two automated evaluators: an LLM-as-judge template and a search-augmented agentic long-form factuality checker adapted from SAFE.
Systematic evaluation of six models across sizes and families, plus manual validation to measure judge reliability.
Open-sourced prompts, datasets, and evaluation code for reproducibility.
Key Findings
Top open model (Qwen3-Next-80B) achieves high accuracy after filtering invalid outputs.
GPT-5 performs strongly across domains.
Adversarial injections have little consistent effect on long-form factuality for the tested model.
Automated judge is largely reliable versus humans.
Many errors result from invalid or failed generations rather than reasoning mistakes.
Results
Accuracy
Accuracy
Long-form factuality (Qwen3-30B F1@8)
Mean change in correct facts due to adversarial injection (Qwen3-30B)
LLM judge vs. human agreement
Who Should Care
What To Try In 7 Days
Run AdversaRiskQA on your model to find domain blind spots and failure modes.
Filter out invalid outputs (null, prompt echo, template leakage) before scoring.
Add an LLM-as-judge plus a small manual review loop to scale factuality checks.
Agent Features
Planning
- self-decomposition of long answers into facts
Tool Use
- web_search tool
- OpenAI Response API
Frameworks
- SAFE-inspired factuality pipeline
Is Agentic
true
Architectures
- LLM-as-a-judge
- search-augmented agentic evaluator
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Datasets are relatively small and English-only, limiting generalization.
- Long-form factuality check was run on one representative model only.
- Dataset creation and some evaluation steps used GPT-5, which may bias items or judging.
- Safety filters and failed generations removed many items; results depend on filtering choices.
When Not To Use
- If you need multilingual adversarial testing (benchmark is English-only).
- As the sole safety check for production-critical decisions without expert review.
- For model families not represented here without additional validation.
Failure Modes
- Null outputs (empty/invalid responses)
- Prompt echo (model repeats prompt instead of answering)
- Template leakage (model outputs system/instruction text)
- Safety-filter refusals causing truncated or missing answers
Core Entities
Models
- Qwen3-4B
- Qwen3-30B
- Qwen3-Next-80B
- GPT-OSS-20B
- GPT-OSS-120B
- GPT-5
- GPT-5-mini
Metrics
- Accuracy
- F1@K
- Mean #facts
- Mean #correct facts
Datasets
- AdversaRiskQA
- HealthFC
- FALQU
- GPT-5 generated finance facts
Benchmarks
- AdversaRiskQA

