Database RAG Papers — Parsed & Scored for Practitioners

Knowledge graph triples GPT-4 accuracy for enterprise QA (16.7% → 54.2%)

0.30

0.45

0.50

7

Adding a knowledge graph layer (ontology + mappings) substantially improves LLM answer accuracy on enterprise SQL: expect major gains for normalized schemas and KPI-style questions.

Key finding

Knowledge-graph context raised GPT-4 execution accuracy from 16.7% to 54.2%.

Numbers: SQL 16.7% → SPARQL 54.2% (Table 1)

Reorder table rows and fields to boost LLM prompt-cache reuse and cut latency/costs

0.80

0.60

0.80

6

If you run LLMs over tables in batches, reordering rows and fields can cut inference time and API bills materially by increasing prompt-cache reuse; it is a low-cost software change that often outperforms adding hardware.

Key finding

GGR reduces end-to-end LLM query latency by 1.5–3.4× vs. caching without reordering (Cache Original) on evaluated queries.

Numbers: 1.5–3.4× speedup (Sec 6.2; Fig 3/4)

Survey of LLM-based text-to-SQL with a focus on Retrieval-Augmented and Graph RAG solutions

0.60

0.70

0.50

2

RAG and especially Graph RAG let teams query unfamiliar databases without costly fine-tuning, but they increase system complexity and latency.

Key finding

RAG improves cross-domain generalization and zero-/few-shot performance by fetching schema-specific context.

Generate short, unique text 'knowledge clues' with an LLM and use them to look up documents for multi-modal queries.

0.60

0.70

2

You can replace multiple modality-specific retrievers with one LLM-based generative retriever that scales to millions of documents, improves precision, and needs only light fine-tuning, lowering engineering and data costs.

Key finding

GeMKR raises P@5 on OKVQA-GS112K to 49.1, beating ReViz-ICT (41.7).

Numbers: P@5: 49.1 vs 41.7 (Table 1)

GeneAgent: an LLM agent that queries biology databases to verify and improve gene‑set function explanations

0.70

0.60

2

GeneAgent reduces false functional claims by checking LLM outputs against curated biology databases, cutting manual validation time and producing more trustworthy gene‑set summaries for research pipelines.

Key finding

GeneAgent increases n‑gram and LCS name overlap over GPT-4 on evaluated datasets.

Numbers: ROUGE-1/ROUGE-L from 23.9%→31.0% (MsigDB); ROUGE-2 7.4%→15.5%

Contrato360 2.0 — agent-orchestrated RAG + text-to-SQL Q&A for contract management

0.70

0.40

0.60

0

A small engineering effort that wires RAG, text-to-SQL, and lightweight agents gives contract teams fast, accurate answers across PDFs and contract databases without retraining LLMs, cutting manual search time.

Key finding

Direct document lookups returned correct answers on the evaluated benchmark questions.

Numbers: Table 1: direct questions show 10/10 correct for listed items

DB-GPT: open-source Python platform for flexible, private LLM-powered data interaction with multi-agent workflows

0.80

0.60

0.70

0

DB-GPT bundles LLMs, private model hosting, multi-agent workflows and RAG so teams can let non-experts query and analyze sensitive data without sending it to external APIs.

Key finding

DB-GPT provides an end-to-end stack combining multi-agent workflows, RAG, AWEL and private model management.

Fine-tune LLMs with execution plans and RL to rewrite SQL that runs faster while staying correct.

0.70

0.65

0.70

0

E3-Rewrite can cut query costs by reducing runtime for heavy analytical queries while keeping results correct. That directly lowers compute bills and improves interactive analytics latency for complex queries.

Key finding

Big average latency reduction on TPC-H.

Numbers: Avg latency 78.81s -> 29.67s (Original vs E3-Rewrite Qwen, Table 1)

Route simple queries straight to fast tools; use memory + planner only for complex job-career requests to cut latency and improve accuracy.

0.70

0.45

0.60

0

You can keep advanced agentic reasoning for hard requests while giving fast answers for common lookups. That reduces user wait time and session rounds, which likely raises engagement and lowers operational cost.

Key finding

AdaptJobRec cuts average response latency by about half compared to a RAG baseline in pilot users.

Numbers: Latency 498 ms vs RAG 1065 ms (≈53% faster)

AP-SQL: combine a small fine-tuned schema filter, example retrieval, and thought-style prompts to run Text-to-SQL with lower cost

0.60

0.50

0.70

0

AP-SQL offers a practical way to run reliable Text-to-SQL with smaller models and lower inference cost by pruning schema context, reusing examples, and using structured prompts.

Key finding

AP-SQL gives consistent EX and TS gains on Spider across evaluated LLMs.

Numbers: GPT-4o: EX 89.7% vs E-SQL 88.6% (+1.1); TS 82.6% vs 79.4% (+3.2)

CarbonChat: LLM system for corporate carbon-emissions analysis using hybrid RAG and Text2SQL

0.60

0.50

0.60

0

CarbonChat automates extraction and structured analysis of long sustainability reports and policy texts, cutting manual effort and providing traceable, SQL-queryable answers for decision-makers.

Key finding

Self-Prompting RAG improves text-generation metrics vs standard RAG.

Numbers: Qwen-Max ROUGE-1 0.592 vs 0.529; BERTScore F1 0.906 vs 0.831

A large benchmark and finer evaluation method for generating grounded insights that pull evidence from multiple tables

0.50

0.60

0.30

0

If your product needs explainable insights from multiple tables, this benchmark and evaluator help measure real-world retrieval+analysis performance and reveal where current LLMs fail to ground facts.

Key finding

Dataset scale and structure

Numbers: 18,532 test examples; 19,563 unique tables; avg gold tables / example = 2.88

DQABench: a 200k QA benchmark and modular testbed to measure LLMs on real database questions

0.60

0

If you build DB assistants, measure three things separately: core LLM skill, retrieval quality, and tool invocation. Improving retrieval and tool-format handling yields bigger gains than switching LLMs alone.

Key finding

Large models and DB-specialized training improve DB QA quality.

Numbers: Baichuan2-cpt-sft avg WinRate gain +0.44 (ZH) / +0.35 (EN) vs vanilla Baichuan2

One-click evaluation, automated ensembles, and LLM-powered Q&A for time series forecasting

0.60

0.50

0.60

0

EasyTime speeds method evaluation and selection by reusing a large benchmark and automating ensembles, reducing experiment time and guesswork for forecasting projects.

Key finding

TFB contains broad data and precomputed results.

Numbers: 25 multivariate datasets; 8,068 univariate datasets; 8,000+ series with results

Use LLM-hallucinated mini-schemas to retrieve small, high-recall DB schema subsets for Text-to-SQL

0.60

0.65

0.60

0

If your product queries very large databases, CRUSH reduces token costs and increases correct SQL generation by selecting a smaller, higher-quality schema subset to send to an LLM.

Key finding

CRUSH improves column recall at moderate budget on SpiderUnion.

Numbers: r@10 = 0.83 (CRUSH) vs 0.77 (best baseline)

SWAN: the first benchmark and baselines for mixing SQL databases with LLMs

0.35

0.60

0.50

0

Combining SQL and LLMs can answer questions that databases alone cannot, but current methods are error-prone and costly; invest in verification, caching, and prompt design before production use.

Key finding

SWAN created 120 beyond-database questions across 4 curated databases.

Numbers: 120 questions; 4 databases

cTBLS: rank table cells with dense encoders and prompt GPT-3.5 with top-k cells to ground chat replies

0.60

0.50

0.60

0

Feeding a small set of ranked table cells to an LLM yields more accurate and preferred conversational answers and cuts errors from wrong-source retrievals; that improves user trust while keeping LLM API calls limited.

Key finding

Dense Table Retrieval (DTR) improves table retrieval vs BM25.

Numbers: MRR@10: 0.491 -> 0.846; Top-1 Acc: 0.345 -> 0.777

Agentic AI pipelines that generate test scenarios and search software project documents

0.60

0.45

0.60

0

Agentic pipelines can automate repetitive SE tasks (test-scenario creation and document search), cut manual labor, and speed onboarding; the systems are deployed internally but lack formal benchmarks.

Key finding

Test scenario generator implemented as a 6-agent star with a supervisor and specialized workers.

Numbers: 6 agents; star topology described in Sec. 3.1

A database-native substrate that makes scientific pipelines safe for AI agents

0.80

0.60

0.70

0

DataJoint reduces risk when automating scientific workflows by making data provenance and computation transactional and machine-readable, lowering costly errors and rework.

Key finding

DataJoint 2.0 unifies data structure, stored objects, and computation under a single queryable schema.

Combine private knowledge across silos by sharing compact, masked parametric adapters instead of raw documents

0.70

0.80

0

FedMosaic lets companies aggregate private knowledge across departments or partners without moving raw documents, cutting network and storage costs dramatically while improving answer accuracy on evaluated QA tasks.

Key finding

Average accuracy gain over state-of-the-art baselines

Numbers: Avg +10.9% F1 across four datasets

Split TableQA into a Data Leader plus Database and Knowledge-Graph teams to cut hallucinations and boost multi-hop answers

0.60

0.50

0

DataFactory trades higher query cost for much better accuracy and explainability on complex table queries, making it useful for teams that need reliable multi-hop analytics and traceable evidence from enterprise tables.

Key finding

Multi-agent DataFactory significantly improves accuracy on standard TableQA benchmarks versus baselines.

Numbers: TabFact avg 84.0% (↑20.2% over baselines)

FinS-Pilot: a 316-query, user-driven benchmark that tests real-time financial RAG with live API data

0.70

0.50

0.60

0

Financial assistants must combine live market APIs with text retrieval; without live data numeric answers are wrong and web search improves content quality.

Key finding

Dataset composition: 316 real user queries covering both time-sensitive numbers and content questions.

Numbers: 316 queries (104 numerical, 212 content).

GraphRAG (Neo4j + Llama‑3) retrieves reported drug side effects with near‑perfect accuracy

0.70

0.50

0.60

0

Graph‑backed retrieval plus a small LLM turns a curated safety database into an almost error‑free lookup service for side‑effect presence, cutting clinician search time and reducing misinformation risk.

Key finding

GraphRAG (Neo4j graph + Llama‑3 8B) achieved near‑perfect retrieval accuracy

Numbers: Accuracy=0.9999; F1=0.9999; Precision=0.9998; Sens=0.9999; Spec=0.9998

A realistic benchmark and frozen-web environment for testing web research agents

0.50

0.40

0.35

0

If your product uses web-research agents, use Deep Research Bench + RetroSearch to track real-world research skills over time and avoid live-web drift; current agents are useful but not yet human-level on hard research tasks.

Key finding

Best ReAct agent mean score observed was 0.51 (o3 agent).

Numbers: Best ReAct score = 0.51 (o3)