Overview
Production Readiness
0.65
Novelty Score
0.55
Cost Impact Score
0.45
Citation Count
1
Why It Matters For Business
If your product answers questions over real PDFs or financial reports, parsing and retrieval choices can change accuracy by tens of points; invest in indexing and retrieval before scaling model size.
Summary TLDR
The authors introduce UDA, a realistic benchmark of 2,965 unstructured documents and 29,590 expert Q&A pairs across finance, academic papers and web/wiki pages. UDA tests end-to-end Retrieval-Augmented Generation (RAG) pipelines on messy PDFs and HTML with raw text, tables and numeric reasoning. Main findings: good parsing and retrieval matter a lot for numeric tasks; GPT-4-style models lead overall; raw-text extraction is often competitive with vision parsing; Chain-of-Thought (CoT) helps arithmetic QA; RAG beats naive long-context LLMs on financial reasoning.
Problem Statement
Existing RAG and document-QA benchmarks often use clean, pre-segmented inputs. Real-world documents (PDFs, HTML) are long, irregular and contain tables. There is no large, end-to-end benchmark that measures parsing, retrieval, and generation together on real unstructured documents.
Main Contribution
UDA dataset: 2,965 raw documents + 29,590 expert Q&A pairs across finance, papers and world-knowledge.
A modular benchmark that measures parsing, indexing, retrieval and generation choices in RAG pipelines.
Empirical findings: parsing and retrieval choices strongly affect numeric/document QA; CoT helps arithmetic; RAG often outperforms long-context LLMs on numeric tasks.
Open resources: benchmark suite and code released on GitHub for reproducible evaluation.
Key Findings
UDA contains 2,965 raw documents and 29,590 expert-annotated Q&A pairs.
GPT-4-Omni parsing achieves the highest table-based QA scores; raw-text extraction is competitive.
Retrieval model choice matters and depends on domain; BM25 beats dense embeddings for finance at top-1 evidence.
Providing accurate context dramatically improves numeric QA; human-annotated evidence raises scores substantially.
RAG outperforms long-context LLMs on numeric/financial tasks; long-context can be comparable on free-form knowledge queries.
Chain-of-Thought (CoT) improves numerical reasoning for several LLMs.
Results
FinHybrid EM (GPT-4-Turbo with parsing)
Retrieval evidence @1 LCS (FinHybrid)
End-to-end FinHybrid (GPT-4-Turbo)
CoT improvement (Llama-3-8B on FinHybrid)
Who Should Care
What To Try In 7 Days
Run raw-text extraction + OpenAI text-embedding-3-large + top-5 retrieval; measure EM/F1 on a small sample.
Compare BM25 vs dense embeddings on your finance docs; pick BM25 if keyword/date matching wins.
Enable Chain-of-Thought prompts for numeric QA and compare with basic prompts.
Optimization Features
Token Efficiency
- Chunking into 3000-character segments with 10% overlap
System Optimization
- Choose sparse vs dense retrieval depending on domain (BM25 for precise finance keywords)
Inference Optimization
- Use raw-text extraction to reduce heavy CV preprocessing
- Limit retrieved chunks (top-5) to balance context and noise
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- No direct, standardized evaluation of parsed content quality; downstream Q&A used as proxy (Section 6).
- Noise sensitivity and hallucination are not analyzed in depth and will be future work (Section 6).
- Long-context experiments are limited by inference cost and used a 600-doc subset (Section 4.3 and B.4).
When Not To Use
- If you need rigorous per-token parsing quality metrics—this benchmark assesses end-to-end QA, not parser error breakdown.
- If your documents regularly exceed model context limits without good retrieval indexes; long-context LLMs may fail.
- If you cannot afford the compute to evaluate long-context or GPT-4-Omni style parsing.
Failure Modes
- CV-based table parsing fails on irregular or non-standard table layouts, producing bad downstream answers (Section 4.1).
- Long-context LLMs can miss precise numeric facts in very long documents and underperform RAG on arithmetic tasks (Section 4.3).
- Poor retrieval leads to large accuracy drops for numeric questions; accurate evidence selection is critical (Section 4.2).
Core Entities
Models
- GPT-4-Turbo
- GPT-4-Omni
- GPT-3.5
- Llama-3-8B
- Llama-3-70B
- Qwen-1.5-32B
- Qwen-1.5-7B
- Mixtral-8x7B
- Mistral-7B
- CodeLlama-7B
- CodeLlama-13B
Metrics
- Exact-Match (1% tolerance)
- Span-level F1
- Numeracy-focused F1
- Longest Common Subsequence (LCS) ratio
- LLM-based evaluator (0–4 normalized)
Datasets
- FinHybrid
- TatHybrid
- PaperTab
- PaperText
- FetaTab
- NqText
- UDA (aggregate)
Benchmarks
- UDA
Context Entities
Models
- Azure OpenAI GPT4-Turbo-1106-Preview (128k)
- Qwen-1.5-7B-32k

