UDA: a 2,965-document benchmark to stress-test RAG on messy PDFs, tables and numeric queries

June 21, 20248 min

Overview

Decision SnapshotReady For Pilot

Scores reflect a ready, practical benchmark and clear empirical signals, but real deployment needs per-domain tuning and compute for long-context/LLM evaluations.

Citations1

Evidence Strength0.85

Confidence0.87

Risk Signals9

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 45%

Production readiness: 65%

Novelty: 55%

Authors

Yulong Hui, Yao Lu, Huanchen Zhang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If your product answers questions over real PDFs or financial reports, parsing and retrieval choices can change accuracy by tens of points; invest in indexing and retrieval before scaling model size.

Who Should Care

Summary TLDR

The authors introduce UDA, a realistic benchmark of 2,965 unstructured documents and 29,590 expert Q&A pairs across finance, academic papers and web/wiki pages. UDA tests end-to-end Retrieval-Augmented Generation (RAG) pipelines on messy PDFs and HTML with raw text, tables and numeric reasoning. Main findings: good parsing and retrieval matter a lot for numeric tasks; GPT-4-style models lead overall; raw-text extraction is often competitive with vision parsing; Chain-of-Thought (CoT) helps arithmetic QA; RAG beats naive long-context LLMs on financial reasoning.

Problem Statement

Existing RAG and document-QA benchmarks often use clean, pre-segmented inputs. Real-world documents (PDFs, HTML) are long, irregular and contain tables. There is no large, end-to-end benchmark that measures parsing, retrieval, and generation together on real unstructured documents.

Main Contribution

UDA dataset: 2,965 raw documents + 29,590 expert Q&A pairs across finance, papers and world-knowledge.

A modular benchmark that measures parsing, indexing, retrieval and generation choices in RAG pipelines.

Key Findings

UDA contains 2,965 raw documents and 29,590 expert-annotated Q&A pairs.

Numbers2,965 documents; 29,590 Q&A (paper §3, Table 2)

Practical UseUse UDA to test end-to-end pipelines on realistic, unparsed docs instead of toy, pre-split datasets.

Evidence RefSection 3, Table 2

GPT-4-Omni parsing achieves the highest table-based QA scores; raw-text extraction is competitive.

NumbersFinHybrid EM: GPT-4-Omni 72.4 vs Raw Text 68.0 (Table 5)

Practical UseStart with raw-text extraction for speed; upgrade to GPT-4-Omni or validated parsing only if raw text misses key structure.

Evidence RefSection 4.1, Table 5

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
FinHybrid EM (GPT-4-Turbo with parsing)72.4Raw-text 68.0+4.4FinHybrid (table parsing, Table 5)GPT-4-Omni parsing yields 72.4 EM vs raw-text 68.0 (Table 5)Table 5
Retrieval evidence @1 LCS (FinHybrid)BM25 65.6OpenAI embedding 57.2+8.4FinHybrid retrieval (Table 6)BM25 @1 LCS 65.6 vs OpenAI 57.2 (Table 6)Table 6

What To Try In 7 Days

Run raw-text extraction + OpenAI text-embedding-3-large + top-5 retrieval; measure EM/F1 on a small sample.

Compare BM25 vs dense embeddings on your finance docs; pick BM25 if keyword/date matching wins.

Enable Chain-of-Thought prompts for numeric QA and compare with basic prompts.

Optimization Features

Token Efficiency
Chunking into 3000-character segments with 10% overlap
System Optimization
Choose sparse vs dense retrieval depending on domain (BM25 for precise finance keywords)
Inference Optimization
Use raw-text extraction to reduce heavy CV preprocessingLimit retrieved chunks (top-5) to balance context and noise

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

No direct, standardized evaluation of parsed content quality; downstream Q&A used as proxy (Section 6).

Noise sensitivity and hallucination are not analyzed in depth and will be future work (Section 6).

When Not To Use

If you need rigorous per-token parsing quality metrics—this benchmark assesses end-to-end QA, not parser error breakdown.

If your documents regularly exceed model context limits without good retrieval indexes; long-context LLMs may fail.

Failure Modes

CV-based table parsing fails on irregular or non-standard table layouts, producing bad downstream answers (Section 4.1).

Long-context LLMs can miss precise numeric facts in very long documents and underperform RAG on arithmetic tasks (Section 4.3).

Core Entities

Models

GPT-4-TurboGPT-4-OmniGPT-3.5Llama-3-8BLlama-3-70BQwen-1.5-32BQwen-1.5-7BMixtral-8x7BMistral-7BCodeLlama-7BCodeLlama-13B

Metrics

Exact-Match (1% tolerance)Span-level F1Numeracy-focused F1Longest Common Subsequence (LCS) ratioLLM-based evaluator (0–4 normalized)

Datasets

FinHybridTatHybridPaperTabPaperTextFetaTabNqTextUDA (aggregate)

Benchmarks

UDA

Context Entities

Models

Azure OpenAI GPT4-Turbo-1106-Preview (128k)Qwen-1.5-7B-32k