Combine multiple OCR engines + two LLMs and pick the best JSON by majority voting to boost invoice OCR accuracy and speed.

December 23, 20246 min

Overview

Decision SnapshotNeeds Validation

Results are promising but based on a small (100-image) invoice dataset and free-tier API constraints; verify on larger, real-world data before production rollout.

Citations1

Evidence Strength0.55

Confidence0.60

Risk Signals10

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 2/2

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 50%

Novelty: 60%

Authors

Osama Abdellatif, Ahmed Ayman, Ali Hamdi

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Combining multiple OCR engines with LLM-based JSON conversion and majority voting can cut extraction errors and improve throughput for invoice automation, reducing manual fixes and speeding up batch processing.

Who Should Care

Summary TLDR

LMV-RPA is a production-style pipeline that runs four OCR engines (PaddleOCR, Tesseract, EasyOCR, DocTR), sends each OCR output to two Large Language Models (LLMs) to convert text into JSON, and picks the final structured result by majority voting. On a 100-image invoice test set the authors report 99% extraction accuracy versus a 94% baseline and an average runtime of 121.27s versus 212–218s for UiPath/Automation Anywhere. The study is promising but uses a small invoice-focused dataset and constrained free-tier APIs, so expect more validation before deploying at scale.

Problem Statement

Standard OCR in RPA struggles with ambiguous characters and complex layouts. Single-engine OCR often misreads noisy or varied invoices and manual fixes are expensive. The paper seeks a reliable, automated pipeline that both improves accuracy of extracted fields and outputs structured JSON for downstream RPA tasks.

Main Contribution

LMV-RPA pipeline: multi-OCR (PaddleOCR, Tesseract, EasyOCR, DocTR) + two LLMs to convert each OCR output to JSON, then majority-vote across JSON outputs.

Empirical comparison showing higher extraction accuracy (reported 99% vs 94%) on a 100-image invoice dataset.

Key Findings

LMV-RPA achieved higher extraction accuracy than the baseline.

Numbers99% vs 94% (reported on 100 invoices)

Practical UseUse multi-engine OCR plus LLM-based JSON conversion and voting to reduce field extraction errors in invoice pipelines.

Evidence RefTable 2 (Accuracy comparison)

LMV-RPA ran faster, on average, than two commercial RPA platforms in authors' tests.

NumbersAverage runtime 121.27s vs UiPath 212.33s and Automation Anywhere 217.80s

Practical UseThe pipeline can lower end-to-end processing time under similar hardware and dataset constraints, helping throughput for batch invoice jobs.

Evidence RefTable 1 (Average Run Time)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy99%Traditional (1 OCR + 1-layer LLM)+5 percentage points100-image invoice test setTable 2 reports 99% for LMV-RPA vs 94% for traditional modelTable 2
Average runtime121.27 secUiPath 212.33 sec; Automation Anywhere 217.80 sec~4344% faster than UiPath/Automation AnywhereBenchmark tasks (same hardware, paper's test)Table 1 reports runtimes per platformTable 1

What To Try In 7 Days

Run LMV-RPA on a small sample of your invoices (50–200) and compare field-level accuracy to your current tool.

Log per-file runtimes to check throughput under your hardware and adjust polling interval.

Test failure cases: low-quality scans, unusual layouts, and non-invoice documents to map limitations.

Agent Features

Memory
Short-term file tracking (seen vs new files)
Planning
Simple event-driven monitoring (detect new files and process)
Tool Use
Uses multiple OCR engines and LLMs as tools
Frameworks
Custom RPA pipeline (LMV-RPA)
Is Agentic

Yes

Architectures
Pipeline: multi-OCR → 2 LLMs → majority votingContinuous directory-watching loop
Collaboration
Voting across independent OCR+LLM outputs

Optimization Features

System Optimization
Asynchronous multi-engine processing to shorten end-to-end time
Inference Optimization
Parallel OCR engines to improve robustnessMajority voting reduces need for heavier single-model correction

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Code URLs

Repo

Data URLs

data set

Risks & Boundaries

Limitations

Small dataset (100 images) focused on invoices only.

Authors used free-tier OCR/APIs and added a 5s delay, which may alter runtime behavior.

When Not To Use

For non-invoice or very different document types without retesting.

When strict, provable privacy constraints forbid sending text to external LLMs.

Failure Modes

Majority voting can reinforce a common OCR misread if all engines err the same way.

LLMs may alter or hallucinate critical field text when converting to JSON.

Core Entities

Models

LLaMA 3Gemini-1.5-proPaddleOCRTesseractEasyOCRDocTR

Metrics

Accuracyaverage runtime (seconds)

Datasets

100-image invoice dataset (Kaggle/Roboflow/Kozlowski samples)Kaggle invoice-like imagesRoboflow invoice_dataKozlowski Samples of Electronic Invoices