UDA: a 2,965-document benchmark to stress-test RAG on messy PDFs, tables and numeric queries

Overview

Decision SnapshotReady For Pilot

Scores reflect a ready, practical benchmark and clear empirical signals, but real deployment needs per-domain tuning and compute for long-context/LLM evaluations.

Citations1

Evidence Strength0.85

Confidence0.87

Risk Signals9

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 45%

Production readiness: 65%

Novelty: 55%

Authors

Yulong Hui, Yao Lu, Huanchen Zhang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If your product answers questions over real PDFs or financial reports, parsing and retrieval choices can change accuracy by tens of points; invest in indexing and retrieval before scaling model size.

Who Should Care

Product Manager CTO ML Engineer Data Scientist Founder

Summary TLDR

The authors introduce UDA, a realistic benchmark of 2,965 unstructured documents and 29,590 expert Q&A pairs across finance, academic papers and web/wiki pages. UDA tests end-to-end Retrieval-Augmented Generation (RAG) pipelines on messy PDFs and HTML with raw text, tables and numeric reasoning. Main findings: good parsing and retrieval matter a lot for numeric tasks; GPT-4-style models lead overall; raw-text extraction is often competitive with vision parsing; Chain-of-Thought (CoT) helps arithmetic QA; RAG beats naive long-context LLMs on financial reasoning.

Problem Statement

Existing RAG and document-QA benchmarks often use clean, pre-segmented inputs. Real-world documents (PDFs, HTML) are long, irregular and contain tables. There is no large, end-to-end benchmark that measures parsing, retrieval, and generation together on real unstructured documents.

Main Contribution

UDA dataset: 2,965 raw documents + 29,590 expert Q&A pairs across finance, papers and world-knowledge.

A modular benchmark that measures parsing, indexing, retrieval and generation choices in RAG pipelines.

Key Findings

UDA contains 2,965 raw documents and 29,590 expert-annotated Q&A pairs.

Numbers2,965 documents; 29,590 Q&A (paper §3, Table 2)

Practical UseUse UDA to test end-to-end pipelines on realistic, unparsed docs instead of toy, pre-split datasets.

Evidence RefSection 3, Table 2

GPT-4-Omni parsing achieves the highest table-based QA scores; raw-text extraction is competitive.

NumbersFinHybrid EM: GPT-4-Omni 72.4 vs Raw Text 68.0 (Table 5)

Practical UseStart with raw-text extraction for speed; upgrade to GPT-4-Omni or validated parsing only if raw text misses key structure.

Evidence RefSection 4.1, Table 5

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
FinHybrid EM (GPT-4-Turbo with parsing)	72.4	Raw-text 68.0	+4.4	FinHybrid (table parsing, Table 5)	GPT-4-Omni parsing yields 72.4 EM vs raw-text 68.0 (Table 5)	Table 5
Retrieval evidence @1 LCS (FinHybrid)	BM25 65.6	OpenAI embedding 57.2	+8.4	FinHybrid retrieval (Table 6)	BM25 @1 LCS 65.6 vs OpenAI 57.2 (Table 6)	Table 6

What To Try In 7 Days

Run raw-text extraction + OpenAI text-embedding-3-large + top-5 retrieval; measure EM/F1 on a small sample.

Compare BM25 vs dense embeddings on your finance docs; pick BM25 if keyword/date matching wins.

Enable Chain-of-Thought prompts for numeric QA and compare with basic prompts.

Optimization Features

Token Efficiency

Chunking into 3000-character segments with 10% overlap

System Optimization

Choose sparse vs dense retrieval depending on domain (BM25 for precise finance keywords)

Inference Optimization

Use raw-text extraction to reduce heavy CV preprocessingLimit retrieved chunks (top-5) to balance context and noise

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/qinchuanhui/UDA-Benchmark

Data URLs

https://github.com/qinchuanhui/UDA-Benchmark

Risks & Boundaries

Limitations

No direct, standardized evaluation of parsed content quality; downstream Q&A used as proxy (Section 6).

Noise sensitivity and hallucination are not analyzed in depth and will be future work (Section 6).

When Not To Use

If you need rigorous per-token parsing quality metrics—this benchmark assesses end-to-end QA, not parser error breakdown.

If your documents regularly exceed model context limits without good retrieval indexes; long-context LLMs may fail.

Failure Modes

CV-based table parsing fails on irregular or non-standard table layouts, producing bad downstream answers (Section 4.1).

Long-context LLMs can miss precise numeric facts in very long documents and underperform RAG on arithmetic tasks (Section 4.3).

Core Entities

Models

GPT-4-TurboGPT-4-OmniGPT-3.5Llama-3-8BLlama-3-70BQwen-1.5-32BQwen-1.5-7BMixtral-8x7BMistral-7BCodeLlama-7BCodeLlama-13B

Metrics

Exact-Match (1% tolerance)Span-level F1Numeracy-focused F1Longest Common Subsequence (LCS) ratioLLM-based evaluator (0–4 normalized)

Datasets

FinHybridTatHybridPaperTabPaperTextFetaTabNqTextUDA (aggregate)

Benchmarks

UDA

Context Entities

Models

Azure OpenAI GPT4-Turbo-1106-Preview (128k)Qwen-1.5-7B-32k

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

UDA contains 2,965 raw documents and 29,590 expert-annotated Q&A pairs.

GPT-4-Omni parsing achieves the highest table-based QA scores; raw-text extraction is competitive.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding