Overview
Scores reflect a ready, practical benchmark and clear empirical signals, but real deployment needs per-domain tuning and compute for long-context/LLM evaluations.
Citations1
Evidence Strength0.85
Confidence0.87
Risk Signals9
Trust Signals
Findings with numeric evidence: 6/6
Findings with evidence refs: 6/6
Results with explicit delta: 4/4
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 45%
Production readiness: 65%
Novelty: 55%
Why It Matters For Business
If your product answers questions over real PDFs or financial reports, parsing and retrieval choices can change accuracy by tens of points; invest in indexing and retrieval before scaling model size.
Who Should Care
Summary TLDR
The authors introduce UDA, a realistic benchmark of 2,965 unstructured documents and 29,590 expert Q&A pairs across finance, academic papers and web/wiki pages. UDA tests end-to-end Retrieval-Augmented Generation (RAG) pipelines on messy PDFs and HTML with raw text, tables and numeric reasoning. Main findings: good parsing and retrieval matter a lot for numeric tasks; GPT-4-style models lead overall; raw-text extraction is often competitive with vision parsing; Chain-of-Thought (CoT) helps arithmetic QA; RAG beats naive long-context LLMs on financial reasoning.
Problem Statement
Existing RAG and document-QA benchmarks often use clean, pre-segmented inputs. Real-world documents (PDFs, HTML) are long, irregular and contain tables. There is no large, end-to-end benchmark that measures parsing, retrieval, and generation together on real unstructured documents.
Main Contribution
UDA dataset: 2,965 raw documents + 29,590 expert Q&A pairs across finance, papers and world-knowledge.
A modular benchmark that measures parsing, indexing, retrieval and generation choices in RAG pipelines.
Key Findings
UDA contains 2,965 raw documents and 29,590 expert-annotated Q&A pairs.
GPT-4-Omni parsing achieves the highest table-based QA scores; raw-text extraction is competitive.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| FinHybrid EM (GPT-4-Turbo with parsing) | 72.4 | Raw-text 68.0 | +4.4 | FinHybrid (table parsing, Table 5) | GPT-4-Omni parsing yields 72.4 EM vs raw-text 68.0 (Table 5) | Table 5 |
| Retrieval evidence @1 LCS (FinHybrid) | BM25 65.6 | OpenAI embedding 57.2 | +8.4 | FinHybrid retrieval (Table 6) | BM25 @1 LCS 65.6 vs OpenAI 57.2 (Table 6) | Table 6 |
What To Try In 7 Days
Run raw-text extraction + OpenAI text-embedding-3-large + top-5 retrieval; measure EM/F1 on a small sample.
Compare BM25 vs dense embeddings on your finance docs; pick BM25 if keyword/date matching wins.
Enable Chain-of-Thought prompts for numeric QA and compare with basic prompts.
Optimization Features
Token Efficiency
System Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
No direct, standardized evaluation of parsed content quality; downstream Q&A used as proxy (Section 6).
Noise sensitivity and hallucination are not analyzed in depth and will be future work (Section 6).
When Not To Use
If you need rigorous per-token parsing quality metrics—this benchmark assesses end-to-end QA, not parser error breakdown.
If your documents regularly exceed model context limits without good retrieval indexes; long-context LLMs may fail.
Failure Modes
CV-based table parsing fails on irregular or non-standard table layouts, producing bad downstream answers (Section 4.1).
Long-context LLMs can miss precise numeric facts in very long documents and underperform RAG on arithmetic tasks (Section 4.3).

