UDA: a 2,965-document benchmark to stress-test RAG on messy PDFs, tables and numeric queries

June 21, 20248 min

Overview

Production Readiness

0.65

Novelty Score

0.55

Cost Impact Score

0.45

Citation Count

1

Authors

Yulong Hui, Yao Lu, Huanchen Zhang

Links

Abstract / PDF

Why It Matters For Business

If your product answers questions over real PDFs or financial reports, parsing and retrieval choices can change accuracy by tens of points; invest in indexing and retrieval before scaling model size.

Summary TLDR

The authors introduce UDA, a realistic benchmark of 2,965 unstructured documents and 29,590 expert Q&A pairs across finance, academic papers and web/wiki pages. UDA tests end-to-end Retrieval-Augmented Generation (RAG) pipelines on messy PDFs and HTML with raw text, tables and numeric reasoning. Main findings: good parsing and retrieval matter a lot for numeric tasks; GPT-4-style models lead overall; raw-text extraction is often competitive with vision parsing; Chain-of-Thought (CoT) helps arithmetic QA; RAG beats naive long-context LLMs on financial reasoning.

Problem Statement

Existing RAG and document-QA benchmarks often use clean, pre-segmented inputs. Real-world documents (PDFs, HTML) are long, irregular and contain tables. There is no large, end-to-end benchmark that measures parsing, retrieval, and generation together on real unstructured documents.

Main Contribution

UDA dataset: 2,965 raw documents + 29,590 expert Q&A pairs across finance, papers and world-knowledge.

A modular benchmark that measures parsing, indexing, retrieval and generation choices in RAG pipelines.

Empirical findings: parsing and retrieval choices strongly affect numeric/document QA; CoT helps arithmetic; RAG often outperforms long-context LLMs on numeric tasks.

Open resources: benchmark suite and code released on GitHub for reproducible evaluation.

Key Findings

UDA contains 2,965 raw documents and 29,590 expert-annotated Q&A pairs.

Numbers2,965 documents; 29,590 Q&A (paper §3, Table 2)

GPT-4-Omni parsing achieves the highest table-based QA scores; raw-text extraction is competitive.

NumbersFinHybrid EM: GPT-4-Omni 72.4 vs Raw Text 68.0 (Table 5)

Retrieval model choice matters and depends on domain; BM25 beats dense embeddings for finance at top-1 evidence.

NumbersFinHybrid @1: BM25 65.6 vs OpenAI 57.2 (Table 6)

Providing accurate context dramatically improves numeric QA; human-annotated evidence raises scores substantially.

NumbersGPT-4 FinHybrid: OpenAI retrieval 45.9 → human-annotated 69.4 (51% relative improvement, Table 7)

RAG outperforms long-context LLMs on numeric/financial tasks; long-context can be comparable on free-form knowledge queries.

NumbersQwen-1.5-7B FinHybrid: RAG 21 vs Long Context 3; GPT-4 FinHybrid: RAG 43.4 vs Long Context 37.4 (Table 8)

Chain-of-Thought (CoT) improves numerical reasoning for several LLMs.

NumbersLlama-3-8B FinHybrid: base 21.3 → CoT 37.9 (+16.6 points, Table 9)

Results

FinHybrid EM (GPT-4-Turbo with parsing)

Value72.4

BaselineRaw-text 68.0

Retrieval evidence @1 LCS (FinHybrid)

ValueBM25 65.6

BaselineOpenAI embedding 57.2

End-to-end FinHybrid (GPT-4-Turbo)

ValueRAG 45.9 → NoRAG 0.4

BaselineNoRAG 0.4

CoT improvement (Llama-3-8B on FinHybrid)

Valuebase 21.3 → CoT 37.9

Baselinebase 21.3

Who Should Care

What To Try In 7 Days

Run raw-text extraction + OpenAI text-embedding-3-large + top-5 retrieval; measure EM/F1 on a small sample.

Compare BM25 vs dense embeddings on your finance docs; pick BM25 if keyword/date matching wins.

Enable Chain-of-Thought prompts for numeric QA and compare with basic prompts.

Optimization Features

Token Efficiency

  • Chunking into 3000-character segments with 10% overlap

System Optimization

  • Choose sparse vs dense retrieval depending on domain (BM25 for precise finance keywords)

Inference Optimization

  • Use raw-text extraction to reduce heavy CV preprocessing
  • Limit retrieved chunks (top-5) to balance context and noise

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • No direct, standardized evaluation of parsed content quality; downstream Q&A used as proxy (Section 6).
  • Noise sensitivity and hallucination are not analyzed in depth and will be future work (Section 6).
  • Long-context experiments are limited by inference cost and used a 600-doc subset (Section 4.3 and B.4).

When Not To Use

  • If you need rigorous per-token parsing quality metrics—this benchmark assesses end-to-end QA, not parser error breakdown.
  • If your documents regularly exceed model context limits without good retrieval indexes; long-context LLMs may fail.
  • If you cannot afford the compute to evaluate long-context or GPT-4-Omni style parsing.

Failure Modes

  • CV-based table parsing fails on irregular or non-standard table layouts, producing bad downstream answers (Section 4.1).
  • Long-context LLMs can miss precise numeric facts in very long documents and underperform RAG on arithmetic tasks (Section 4.3).
  • Poor retrieval leads to large accuracy drops for numeric questions; accurate evidence selection is critical (Section 4.2).

Core Entities

Models

  • GPT-4-Turbo
  • GPT-4-Omni
  • GPT-3.5
  • Llama-3-8B
  • Llama-3-70B
  • Qwen-1.5-32B
  • Qwen-1.5-7B
  • Mixtral-8x7B
  • Mistral-7B
  • CodeLlama-7B
  • CodeLlama-13B

Metrics

  • Exact-Match (1% tolerance)
  • Span-level F1
  • Numeracy-focused F1
  • Longest Common Subsequence (LCS) ratio
  • LLM-based evaluator (0–4 normalized)

Datasets

  • FinHybrid
  • TatHybrid
  • PaperTab
  • PaperText
  • FetaTab
  • NqText
  • UDA (aggregate)

Benchmarks

  • UDA

Context Entities

Models

  • Azure OpenAI GPT4-Turbo-1106-Preview (128k)
  • Qwen-1.5-7B-32k